---
layout: model
title: English asr_wav2vec2_large_xls_r_thai_test TFWav2Vec2ForCTC from juierror
author: John Snow Labs
name: asr_wav2vec2_large_xls_r_thai_test
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_thai_test` is a English model originally trained by juierror.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_thai_test_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_thai_test_en_4.2.0_3.0_1664024143489.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_thai_test_en_4.2.0_3.0_1664024143489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xls_r_thai_test", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xls_r_thai_test", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xls_r_thai_test|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Detect Nationality / Company Founding Places in texts
author: John Snow Labs
name: finner_wiki_nationality
date: 2023-01-15
tags: [en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is an NER model, aimed to detect Nationalities, more specifically when talking about countries founding place. It was trained with wikipedia texts about companies.
## Predicted Entities
`NATIONALITY`, `O`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_wiki_nationality_en_1.0.0_3.0_1673797584937.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_wiki_nationality_en_1.0.0_3.0_1673797584937.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-----------+--------+---+-----------+
|sentence_id|chunk |end|ner_label |
+-----------+--------+---+-----------+
|0 |American|73 |NATIONALITY|
+-----------+--------+---+-----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finner_wiki_nationality|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|1.2 MB|
## References
Wikipedia
## Benchmarking
```bash
label tp fp fn prec rec f1
B-NATIONALITY 57 7 1 0.890625 0.98275864 0.93442625
Macro-average 57 7 1 0.890625 0.98275864 0.93442625
Micro-average 57 7 1 0.890625 0.98275864 0.93442625
```
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_10_h_512
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-10_H-512` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_512_zh_4.2.4_3.0_1670325743888.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_512_zh_4.2.4_3.0_1670325743888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_512","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_512","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_10_h_512|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|161.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-10_H-512
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Translate English to Bantu languages Pipeline
author: John Snow Labs
name: translate_en_bnt
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, bnt, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `bnt`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_bnt_xx_2.7.0_2.4_1609687439138.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_bnt_xx_2.7.0_2.4_1609687439138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_bnt", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_bnt", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.bnt').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_bnt|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from Gantenbein)
author: John Snow Labs
name: roberta_qa_addi_fr_xlm_r
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-FR-XLM-R` is a English model originally trained by `Gantenbein`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_addi_fr_xlm_r_en_4.3.0_3.0_1674207724209.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_addi_fr_xlm_r_en_4.3.0_3.0_1674207724209.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_addi_fr_xlm_r","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_addi_fr_xlm_r","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_addi_fr_xlm_r|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|422.7 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Gantenbein/ADDI-FR-XLM-R
---
layout: model
title: Entity Recognition Pipeline (Large, Spanish)
author: John Snow Labs
name: entity_recognizer_lg
date: 2022-06-25
tags: [es, open_source]
task: Named Entity Recognition
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_es_4.0.0_3.0_1656126065101.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_es_4.0.0_3.0_1656126065101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("entity_recognizer_lg", "es")
result = pipeline.annotate("""I love johnsnowlabs! """)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.ner.lg").predict("""I love johnsnowlabs! """)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|es|
|Size:|2.5 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- NerDLModel
- NerConverter
---
layout: model
title: Relation Extraction between Posologic entities
author: John Snow Labs
name: posology_re
date: 2020-09-01
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 2.5.5
spark_version: 2.4
tags: [re, en, clinical, licensed, relation extraction]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts relations between posology-related terminology.
## Predicted Entities
`DRUG-DOSAGE`, `DRUG-FREQUENCY`, `DRUG-ADE`, `DRUG-FORM`, `ENDED_BY`, `DRUG-ROUTE`, `DRUG-DURATION`, `DRUG-REASON`, `DRUG-STRENGTH`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_POSOLOGY/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentences")
tokenizer = Tokenizer() \
.setInputCols(["sentences"]) \
.setOutputCol("tokens")
words_embedder = WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
ner_tagger = MedicalNerModel()\
.pretrained("ner_posology", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_chunker = NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
reModel = RelationExtractionModel()\
.pretrained("posology_re")\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setMaxSyntacticDistance(4)
pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, reModel])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
light_pipeline = LightPipeline(model)
result = light_pipeline.fullAnnotate("The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also given 1 unit of Metformin daily. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.")
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
tokenizer = Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val ner_tagger = MedicalNerModel()
.pretrained("ner_posology", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_chunker = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
val re_Model = RelationExtractionModel()
.pretrained("posology_re")
.setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies"))
.setOutputCol("relations")
.setMaxSyntacticDistance(4)
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependecy_parser, re_Model))
val data = Seq("The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also given 1 unit of Metformin daily. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
| relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence |
|----------------|----------|---------------|-------------|------------------|-----------|---------------|-------------|------------------|------------|
| DURATION-DRUG | DURATION | 493 | 500 | five-day | DRUG | 512 | 522 | amoxicillin | 1.0 |
| DRUG-DURATION | DRUG | 681 | 693 | dapagliflozin | DURATION | 695 | 708 | for six months | 1.0 |
| DRUG-ROUTE | DRUG | 1940 | 1946 | insulin | ROUTE | 1948 | 1951 | drip | 1.0 |
| DOSAGE-DRUG | DOSAGE | 2255 | 2262 | 40 units | DRUG | 2267 | 2282 | insulin glargine | 1.0 |
| DRUG-FREQUENCY | DRUG | 2267 | 2282 | insulin glargine | FREQUENCY | 2284 | 2291 | at night | 1.0 |
| DOSAGE-DRUG | DOSAGE | 2295 | 2302 | 12 units | DRUG | 2307 | 2320 | insulin lispro | 1.0 |
| DRUG-FREQUENCY | DRUG | 2307 | 2320 | insulin lispro | FREQUENCY | 2322 | 2331 | with meals | 1.0 |
| DRUG-STRENGTH | DRUG | 2339 | 2347 | metformin | STRENGTH | 2349 | 2355 | 1000 mg | 1.0 |
| DRUG-FREQUENCY | DRUG | 2339 | 2347 | metformin | FREQUENCY | 2357 | 2371 | two times a day | 1.0 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|posology_re|
|Compatibility:|Healthcare NLP 2.5.5+|
|Edition:|Official|
|License:|Licensed|
|Language:|[en]|
---
layout: model
title: ELECTRA Embeddings(ELECTRA Base)
author: John Snow Labs
name: electra_base_uncased
date: 2020-08-27
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
ELECTRA is a BERT-like model that is pre-trained as a discriminator in a set-up resembling a generative adversarial network (GAN). It was originally published by:
Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning: [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/forum?id=r1xMH1BtvB), ICLR 2020.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_base_uncased_en_2.6.0_2.4_1598485481403.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_base_uncased_en_2.6.0_2.4_1598485481403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("electra_base_uncased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("electra_base_uncased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.electra.base_uncased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_electra_base_uncased_embeddings
I [-0.5244714021682739, -0.0994749441742897, 0.2...
love [-0.14990234375, -0.45483139157295227, 0.28477...
NLP [-0.030217083171010017, -0.43060103058815, -0....
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_base_uncased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/google/electra_base/2
---
layout: model
title: English asr_20220507_122935 TFWav2Vec2ForCTC from lilitket
author: John Snow Labs
name: pipeline_asr_20220507_122935
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_20220507_122935` is a English model originally trained by lilitket.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_20220507_122935_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_20220507_122935_en_4.2.0_3.0_1664117535687.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_20220507_122935_en_4.2.0_3.0_1664117535687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_20220507_122935', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_20220507_122935", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_20220507_122935|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Smaller BERT Sentence Embeddings (L-4_H-256_A-4)
author: John Snow Labs
name: sent_small_bert_L4_256
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_256_en_2.6.0_2.4_1598350389644.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_256_en_2.6.0_2.4_1598350389644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_256", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_256", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.small_bert_L4_256').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
sentence en_embed_sentence_small_bert_L4_256_embeddings
I hate cancer [-0.13163965940475464, 0.5425440073013306, 0.6...
Antibiotics aren't painkiller [-0.4377692639827728, 0.5017094016075134, 0.42...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_small_bert_L4_256|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|256|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1
---
layout: model
title: Fast Neural Machine Translation Model from Congo Swahili to English
author: John Snow Labs
name: opus_mt_swc_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, swc, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `swc`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_swc_en_xx_2.7.0_2.4_1609162361398.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_swc_en_xx_2.7.0_2.4_1609162361398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_swc_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_swc_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.swc.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_swc_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
(https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models]
---
layout: model
title: Translate Japanese to English Pipeline
author: John Snow Labs
name: translate_jap_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, jap, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `jap`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_jap_en_xx_2.7.0_2.4_1609688530036.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_jap_en_xx_2.7.0_2.4_1609688530036.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_jap_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_jap_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.jap.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_jap_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering model (from tabo)
author: John Snow Labs
name: distilbert_qa_tabo_base_uncased_finetuned_squad2
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `tabo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_tabo_base_uncased_finetuned_squad2_en_4.0.0_3.0_1654726880025.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_tabo_base_uncased_finetuned_squad2_en_4.0.0_3.0_1654726880025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tabo_base_uncased_finetuned_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tabo_base_uncased_finetuned_squad2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.distil_bert.base_uncased.by_tabo").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_tabo_base_uncased_finetuned_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/tabo/distilbert-base-uncased-finetuned-squad2
---
layout: model
title: Typo Detector
author: John Snow Labs
name: distilbert_token_classifier_typo_detector
date: 2022-01-19
tags: [typo, distilbert, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.3.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model was imported from `Hugging Face` ([link](https://huggingface.co/m3hrdadfi/typo-detector-distilbert-en)) and it's been trained on NeuSpell corpus to detect typos, leveraging `DistilBERT` embeddings and `DistilBertForTokenClassification` for NER purposes. It classifies typo tokens as `PO`.
## Predicted Entities
`PO`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_en_3.3.4_3.0_1642581005021.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_en_3.3.4_3.0_1642581005021.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "en")\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
text = """He had also stgruggled with addiction during his tine in Congress."""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "en")
.setInputCols(Array("sentence","token"))
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val example = Seq.empty["He had also stgruggled with addiction during his tine in Congress."].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.typos.distilbert").predict("""He had also stgruggled with addiction during his tine in Congress.""")
```
## Results
```bash
+------------+---------+
|chunk |ner_label|
+------------+---------+
|stgruggled |PO |
|tine |PO |
+------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_typo_detector|
|Compatibility:|Spark NLP 3.3.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|244.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## Data Source
[https://github.com/neuspell/neuspell](https://github.com/neuspell/neuspell)
## Benchmarking
```bash
label precision recall f1-score support
micro-avg 0.992332 0.985997 0.989154 416054.0
macro-avg 0.992332 0.985997 0.989154 416054.0
weighted-avg 0.992332 0.985997 0.989154 416054.0
```
---
layout: model
title: Bulgarian BertForMaskedLM Base Cased model (from Geotrend)
author: John Snow Labs
name: bert_embeddings_base_bg_cased
date: 2022-12-02
tags: [bg, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: bg
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-bg-cased` is a Bulgarian model originally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_bg_cased_bg_4.2.4_3.0_1670016244564.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_bg_cased_bg_4.2.4_3.0_1670016244564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_bg_cased","bg") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_bg_cased","bg")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_bg_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|bg|
|Size:|358.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-bg-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Clinical QA BioGPT (JSL - conditions)
author: John Snow Labs
name: biogpt_chat_jsl_conditions
date: 2023-05-11
tags: [en, licensed, clinical, tensorflow]
task: Text Generation
language: en
edition: Healthcare NLP 4.4.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MedicalTextGenerator
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is based on BioGPT finetuned with questions related to various medical conditions. It's less conversational, and more Q&A focused.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/biogpt_chat_jsl_conditions_en_4.4.0_3.0_1683778577103.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/biogpt_chat_jsl_conditions_en_4.4.0_3.0_1683778577103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
gpt_qa = MedicalTextGenerator.pretrained("biogpt_chat_jsl_conditions", "en", "clinical/models")\
.setInputCols("documents")\
.setOutputCol("answer").setMaxNewTokens(100)
pipeline = Pipeline().setStages([document_assembler, gpt_qa])
data = spark.createDataFrame([["How to treat asthma ?"]]).toDF("text")
pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val summarizer = MedicalTextGenerator.pretrained("biogpt_chat_jsl_conditions", "en", "clinical/models")
.setInputCols("documents")
.setOutputCol("answer").setMaxNewTokens(100)
val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer))
val text = "How to treat asthma ?"
val data = Seq(Array(text)).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
question: How to treat asthma?. answer: The main treatments for asthma are reliever inhalers, which are small handheld devices that you put into your mouth or nose to help you breathe quickly, and preventer inhaler, a soft mist inhaler that lets you use your inhaler as often as you like. If you have severe asthma, your doctor may prescribe a long-acting bronchodilator, such as salmeterol or vilanterol, or a steroid inhaler. You'll usually need to take both types of inhaler at the same time.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|biogpt_chat_jsl_conditions|
|Compatibility:|Healthcare NLP 4.4.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.4 GB|
|Case sensitive:|true|
---
layout: model
title: Translate English to Setswana Pipeline
author: John Snow Labs
name: translate_en_tn
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, tn, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `tn`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tn_xx_2.7.0_2.4_1609690891054.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tn_xx_2.7.0_2.4_1609690891054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_tn", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_tn", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.tn').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_tn|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Explain Document pipeline for Spanish (explain_document_lg)
author: John Snow Labs
name: explain_document_lg
date: 2021-03-23
tags: [open_source, spanish, explain_document_lg, pipeline, es]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: es
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_es_3.0.0_3.0_1616497458202.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_es_3.0.0_3.0_1616497458202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_lg', lang = 'es')
annotations = pipeline.fullAnnotate(""Hola de John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_lg", lang = "es")
val result = pipeline.fullAnnotate("Hola de John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hola de John Snow Labs! ""]
result_df = nlu.load('es.explain.lg').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Hola de John Snow Labs! '] | ['Hola de John Snow Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | ['PART', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[-0.016199000179767,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|es|
---
layout: model
title: Mapping Entities with Corresponding RxNorm Codes
author: John Snow Labs
name: rxnorm_mapper
date: 2022-06-07
tags: [en, rxnorm, licensed, chunk_mapper]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 3.5.0
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps entities with their corresponding RxNorm codes
## Predicted Entities
`rxnorm_code`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_mapper_en_3.5.0_3.0_1654614618628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_mapper_en_3.5.0_3.0_1654614618628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
posology_ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("posology_ner")
posology_ner_converter = NerConverterInternal()\
.setInputCols("sentence", "token", "posology_ner")\
.setOutputCol("ner_chunk")
chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setRel("rxnorm_code")
mapper_pipeline = Pipeline().setStages([
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
posology_ner_model,
posology_ner_converter,
chunkerMapper])
data = spark.createDataFrame([["The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray"]]).toDF("text")
result = mapper_pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val posology_ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("posology_ner")
val posology_ner_converter = new NerConverterInternal()
.setInputCols("sentence", "token", "posology_ner")
.setOutputCol("ner_chunk")
val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("mappings")
.setRel("rxnorm_code")
val mapper_pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
posology_ner_model,
posology_ner_converter,
chunkerMapper))
val senetence= "The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray"
val data = Seq(senetence).toDF("text")
val result = mapper_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.rxnorm_resolver").predict("""The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray""")
```
## Results
```bash
+------------------------------+---------------+
|chunk |rxnorm_mappings|
+------------------------------+---------------+
|Zyrtec 10 MG |1011483 |
|Adapin 10 MG Oral Capsule |1000050 |
|Septi-Soothe 0.5 Topical Spray|1000046 |
+------------------------------+---------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|rxnorm_mapper|
|Compatibility:|Healthcare NLP 3.5.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[posology_ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|2.3 MB|
---
layout: model
title: English BertForQuestionAnswering model (from Ghost1)
author: John Snow Labs
name: bert_qa_bert_finetuned_squad1
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad1` is a English model orginally trained by `Ghost1`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad1_en_4.0.0_3.0_1654535985704.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad1_en_4.0.0_3.0_1654535985704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_squad1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_finetuned_squad1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.by_Ghost1").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_finetuned_squad1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Ghost1/bert-finetuned-squad1
---
layout: model
title: Swedish asr_wav2vec2_large_voxrex_swedish_4gram TFWav2Vec2ForCTC from viktor-enzell
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_voxrex_swedish_4gram
date: 2022-09-25
tags: [wav2vec2, sv, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: sv
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_voxrex_swedish_4gram` is a Swedish model originally trained by viktor-enzell.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_voxrex_swedish_4gram_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_voxrex_swedish_4gram_sv_4.2.0_3.0_1664113986284.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_voxrex_swedish_4gram_sv_4.2.0_3.0_1664113986284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_voxrex_swedish_4gram', lang = 'sv')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_voxrex_swedish_4gram", lang = "sv")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_voxrex_swedish_4gram|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|sv|
|Size:|757.4 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Greek BertForMaskedLM Base Uncased model (from gealexandri)
author: John Snow Labs
name: bert_embeddings_greeksocial_base_greek_uncased_v1
date: 2022-12-02
tags: [el, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: el
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `greeksocialbert-base-greek-uncased-v1` is a Greek model originally trained by `gealexandri`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_greeksocial_base_greek_uncased_v1_el_4.2.4_3.0_1670022274783.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_greeksocial_base_greek_uncased_v1_el_4.2.4_3.0_1670022274783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_greeksocial_base_greek_uncased_v1","el") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_greeksocial_base_greek_uncased_v1","el")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_greeksocial_base_greek_uncased_v1|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|el|
|Size:|424.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/gealexandri/greeksocialbert-base-greek-uncased-v1
- http://www.paloservices.com/
---
layout: model
title: Extract Effective, Renewal, Termination dates (Small)
author: John Snow Labs
name: legner_dates_sm
date: 2022-11-21
tags: [renewal, effective, termination, date, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This models extracts if a date is a Effective Date, a Renewal Date or a Termination Date, and also extracts the keywords surrounding that may be pointing about what kind of date it is. Please take into account that the keyword was not used to learn the date, all entities were training separately. But you can use the keywords to double check the date type is correct.
## Predicted Entities
`EFFDATE`, `EFFDATE_KEYWORD`, `RENDATE`, `RENDATE_KEYWORD`, `TERMINDATE`, `TERMINDATE_KEYWORD`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_dates_sm_en_1.0.0_3.0_1669028480461.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_dates_sm_en_1.0.0_3.0_1669028480461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained('legner_dates_sm', 'en', 'legal/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""RENEWAL DATE. The date on which this Agreement shall renew, July 1st, pursuant to the terms and conditions contained herein."""]
res = model.transform(spark.createDataFrame([text]).toDF("text"))
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_supriyaarun_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_supriyaarun_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_supriyaarun_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/SupriyaArun/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Legal Public Finance And Budget Policy Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_public_finance_and_budget_policy_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, public_finance_and_budget_policy, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_public_finance_and_budget_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Public_Finance_and_Budget_Policy or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Public_Finance_and_Budget_Policy`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_public_finance_and_budget_policy_bert_en_1.0.0_3.0_1678111827097.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_public_finance_and_budget_policy_bert_en_1.0.0_3.0_1678111827097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Public_Finance_and_Budget_Policy]|
|[Other]|
|[Other]|
|[Public_Finance_and_Budget_Policy]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_public_finance_and_budget_policy_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.4 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.91 0.86 0.89 37
Public_Finance_and_Budget_Policy 0.86 0.91 0.88 33
accuracy - - 0.89 70
macro-avg 0.89 0.89 0.89 70
weighted-avg 0.89 0.89 0.89 70
```
---
layout: model
title: Clinical Deidentification
author: John Snow Labs
name: clinical_deidentification
date: 2021-05-27
tags: [deidentification, en, licensed, pipeline]
task: De-identification
language: en
nav_key: models
edition: Healthcare NLP 3.0.3
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR` entities.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_3.0.3_3.0_1622141991699.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_3.0.3_3.0_1622141991699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models")
deid_pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = PretrainedPipeline("clinical_deidentification","en","clinical/models")
val result = pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.de_identify.clinical_pipeline").predict("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""")
```
## Results
```bash
{'sentence': ['Record date : 2093-01-13, David Hale, M.D.',
'IP: 203.120.223.13.',
'The driver's license no:A334455B.',
'the SSN:324598674 and e-mail: hale@gmail.com.',
'Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93.',
'PCP : Oliveira, 25 years-old.',
'Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.'],
'masked': ['Record date : , , M.D.',
'IP: .',
'The driver's license .',
'the and e-mail: .',
'Name : MR. # Date : .',
'PCP : , years-old.',
'Record date : , Patient's VIN : .'],
'obfuscated': ['Record date : 2093-02-13, Shella Solan, M.D.',
'IP: 333.333.333.333.',
'The driver's license O497302436569.',
'the SSN-539-29-1060 and e-mail: Keith@google.com.',
'Name : Cindy Nakai MR. # I7396944 Date : 06-11-1985.',
'PCP : Benigno Paganini, 3 years-old.',
'Record date : 2079-12-30, Patient's VIN : 5eeee44ffff555666.'],
'ner_chunk': ['2093-01-13',
'David Hale',
'no:A334455B',
'SSN:324598674',
'Hendrickson, Ora',
'719435',
'01/13/93',
'Oliveira',
'25',
'2079-11-09',
'1HGBH41JXMN109286']}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.0.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
## Included Models
- DocumentAssembler
- TokenizerModel
- LemmatizerModel
- Finisher
---
layout: model
title: Legal Dividends Clause Binary Classifier
author: John Snow Labs
name: legclf_dividends_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `dividends` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `dividends`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dividends_clause_en_1.0.0_3.2_1660123433871.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dividends_clause_en_1.0.0_3.2_1660123433871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[dividends]|
|[other]|
|[other]|
|[dividends]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_dividends_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
dividends 0.93 0.96 0.95 28
other 0.98 0.97 0.98 64
accuracy - - 0.97 92
macro-avg 0.96 0.97 0.96 92
weighted-avg 0.97 0.97 0.97 92
```
---
layout: model
title: English RobertaForQuestionAnswering (from tli8hf)
author: John Snow Labs
name: roberta_qa_unqover_roberta_large_newsqa
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-roberta-large-newsqa` is a English model originally trained by `tli8hf`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_roberta_large_newsqa_en_4.0.0_3.0_1655740252612.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_roberta_large_newsqa_en_4.0.0_3.0_1655740252612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_roberta_large_newsqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_unqover_roberta_large_newsqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.news.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_unqover_roberta_large_newsqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/tli8hf/unqover-roberta-large-newsqa
---
layout: model
title: Sentence Entity Resolver for Clinical Abbreviations and Acronyms (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_clinical_abbreviation_acronym
date: 2021-12-11
tags: [abbreviation, entity_resolver, licensed, en, clinical, acronym]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical abbreviations and acronyms to their meanings using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It is the first primitive version of abbreviation resolution and will be improved further in the following releases.
## Predicted Entities
`Abbreviation Meanings`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_abbreviation_acronym_en_3.3.4_2.4_1639224244652.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_abbreviation_acronym_en_3.3.4_2.4_1639224244652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
c2doc = Chunk2Doc()\
.setInputCols("merged_chunk")\
.setOutputCol("ner_chunk_doc")
sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["document", "merged_chunk"])\
.setOutputCol("sentence_embeddings")\
.setChunkWeight(0.5)
abbr_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("abbr_meaning")\
.setDistanceFunction("EUCLIDEAN")\
.setCaseSensitive(False)
resolver_pipeline = Pipeline(
stages = [
document_assembler,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter_icd,
entity_extractor,
chunk_merge,
c2doc,
sentence_chunk_embeddings,
abbr_resolver
])
model = resolver_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))
sample_text = "HISTORY OF PRESENT ILLNESS: The patient three weeks ago was seen at another clinic for upper respiratory infection-type symptoms. She was diagnosed with a viral infection and had used OTC medications including Tylenol, Sudafed, and Nyquil."
abbr_result = model.transform(spark.createDataFrame([[text]]).toDF('text'))
```
```scala
...
val c2doc = Chunk2Doc()
.setInputCols("merged_chunk")
.setOutputCol("ner_chunk_doc")
val sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols(Array("document", "merged_chunk"))
.setOutputCol("sentence_embeddings")
.setChunkWeight(0.5)
val abbr_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models")
.setInputCols(Array("sentence_embeddings"))
.setOutputCol("abbr_meaning")
.setDistanceFunction("EUCLIDEAN")
.setCaseSensitive(False)
val resolver_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, word_embeddings, clinical_ner, ner_converter_icd, entity_extractor, chunk_merge, c2doc, sentence_chunk_embeddings, abbr_resolver))
val sample_text = Seq("HISTORY OF PRESENT ILLNESS: The patient three weeks ago was seen at another clinic for upper respiratory infection-type symptoms. She was diagnosed with a viral infection and had used OTC medications including Tylenol, Sudafed, and Nyquil.").toDF("text")
val abbr_result = resolver_pipeline.fit(sample_text).transform(sample_text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.clinical_abbreviation_acronym").predict("""HISTORY OF PRESENT ILLNESS: The patient three weeks ago was seen at another clinic for upper respiratory infection-type symptoms. She was diagnosed with a viral infection and had used OTC medications including Tylenol, Sudafed, and Nyquil.""")
```
## Results
```bash
| sent_id | ner_chunk | entity | abbr_meaning | all_k_results | all_k_resolutions |
|----------:|:------------|:---------|:-----------------|:-----------------------------------------------------------------------------------|:---------------------------|
| 0 | OTC | ABBR | over the counter | ['over the counter', 'ornithine transcarbamoylase', 'enteric-coated', 'thyroxine'] | ['OTC', 'OTC', 'EC', 'T4'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_clinical_abbreviation_acronym|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[abbr_meaning]|
|Language:|en|
|Size:|104.9 MB|
|Case sensitive:|false|
---
layout: model
title: DistilBERT base model (uncased)
author: John Snow Labs
name: distilbert_base_uncased
date: 2021-05-20
tags: [distilbert, en, english, embeddings, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-cased). It was introduced in [this paper](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation). This model is uncased: it does not make a difference between english and English.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_uncased_en_3.1.0_2.4_1621522159616.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_uncased_en_3.1.0_2.4_1621522159616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = DistilBertEmbeddings.pretrained("distilbert_base_uncased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
```
```scala
val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_uncased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.distilbert.base.uncased").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_base_uncased|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[embeddings]|
|Language:|en|
|Case sensitive:|true|
## Data Source
[https://huggingface.co/distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased)
## Benchmarking
```bash
When fine-tuned on downstream tasks, this model achieves the following results:
Glue test results:
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
| | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 |
```
---
layout: model
title: Legal Liens Clause Binary Classifier
author: John Snow Labs
name: legclf_liens_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `liens` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `liens`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_liens_clause_en_1.0.0_3.2_1660122624426.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_liens_clause_en_1.0.0_3.2_1660122624426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[liens]|
|[other]|
|[other]|
|[liens]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_liens_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
liens 0.95 0.84 0.89 44
other 0.93 0.98 0.96 99
accuracy - - 0.94 143
macro-avg 0.94 0.91 0.92 143
weighted-avg 0.94 0.94 0.94 143
```
---
layout: model
title: English asr_wav2vec2_large_960h_lv60 TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: asr_wav2vec2_large_960h_lv60
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60` is a English model originally trained by facebook.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_960h_lv60_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_en_4.2.0_3.0_1664017360276.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_en_4.2.0_3.0_1664017360276.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_960h_lv60", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_960h_lv60", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_960h_lv60|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|757.4 MB|
---
layout: model
title: Legal Energy Policy Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_energy_policy_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, energy_policy, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_energy_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Energy_Policy or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Energy_Policy`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_energy_policy_bert_en_1.0.0_3.0_1678111634600.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_energy_policy_bert_en_1.0.0_3.0_1678111634600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Energy_Policy]|
|[Other]|
|[Other]|
|[Energy_Policy]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_energy_policy_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.4 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Energy_Policy 0.85 0.91 0.88 57
Other 0.88 0.80 0.84 46
accuracy - - 0.86 103
macro-avg 0.87 0.86 0.86 103
weighted-avg 0.87 0.86 0.86 103
```
---
layout: model
title: Lemmatizer (Catalan, SpacyLookup)
author: John Snow Labs
name: lemma_spacylookup
date: 2022-03-03
tags: [open_source, lemmatizer, ca]
task: Lemmatization
language: ca
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Catalan Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_ca_3.4.1_3.0_1646316619311.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_ca_3.4.1_3.0_1646316619311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","ca") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["No ets millor que jo"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","ca")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer))
val data = Seq("No ets millor que jo").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ca.lemma").predict("""No ets millor que jo""")
```
## Results
```bash
+--------------------------+
|result |
+--------------------------+
|[No, ets, millor, que, jo]|
+--------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma_spacylookup|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|ca|
|Size:|7.0 MB|
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Adrian)
author: John Snow Labs
name: distilbert_qa_adrian_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Adrian`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_adrian_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768221355.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_adrian_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768221355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_adrian_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_adrian_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_adrian_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Adrian/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Zero-shot Legal NER (CUAD, small)
author: John Snow Labs
name: legner_roberta_zeroshot_cuad_small
date: 2023-01-30
tags: [zero, shot, cuad, en, licensed, tensorflow]
task: Named Entity Recognition
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Zero-shot NER model, trained using Roberta on SQUAD and finetuned to perform Zero-shot NER using CUAD legal dataset. In order to use it, a specific prompt is required. This is an example of it for extracting PARTIES:
```
"Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract"
```
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_roberta_zeroshot_cuad_small_en_1.0.0_3.0_1675089181024.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_roberta_zeroshot_cuad_small_en_1.0.0_3.0_1675089181024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
zeroshot = nlp.ZeroShotNerModel.pretrained("legner_roberta_zeroshot_cuad_small","en","legal/models")\
.setInputCols(["document", "token"])\
.setOutputCol("zero_shot_ner")\
.setEntityDefinitions(
{
'PARTIES': ['Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract']
})
nerconverter = NerConverter()\
.setInputCols(["document", "token", "zero_shot_ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline().setStages([
document_assembler,
tokenizer,
zeroshot,
nerconverter
])
from pyspark.sql import types as T
sample_text = ["""THIS CREDIT AGREEMENT is dated as of April 29, 2010, and is made by and
among P.H. GLATFELTER COMPANY, a Pennsylvania corporation ( the "COMPANY") and
certain of its subsidiaries. Identified on the signature pages hereto (each a
"BORROWER" and collectively, the "BORROWERS"), each of the GUARANTORS (as
hereinafter defined), the LENDERS (as hereinafter defined), PNC BANK, NATIONAL
ASSOCIATION, in its capacity as agent for the Lenders under this Agreement
(hereinafter referred to in such capacity as the "ADMINISTRATIVE AGENT"), and,
for the limited purpose of public identification in trade tables, PNC CAPITAL
MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA, as joint arrangers and joint
bookrunners, and CITIZENS BANK OF PENNSYLVANIA, as syndication agent.""".replace('\n',' ')]
p_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
res = p_model.transform(spark.createDataFrame(sample_text, T.StringType()).toDF("text"))
res.show()
```
## Results
```bash
+-----------------------+---------+
|chunk |ner_label|
+-----------------------+---------+
|P.H. GLATFELTER COMPANY|PARTIES |
+-----------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_roberta_zeroshot_cuad_small|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|449.0 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
SQUAD and CUAD
---
layout: model
title: Smaller BERT Embeddings (L-2_H-768_A-12)
author: John Snow Labs
name: small_bert_L2_768
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L2_768_en_2.6.0_2.4_1598344957042.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L2_768_en_2.6.0_2.4_1598344957042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("small_bert_L2_768", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("small_bert_L2_768", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert.small_L2_768').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_bert_small_L2_768_embeddings
I [-0.2451338768005371, 0.40763044357299805, -0....
love [-0.23793038725852966, -0.07403656840324402, -...
NLP [-0.864113450050354, -0.2902209758758545, 0.54...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|small_bert_L2_768|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1
---
layout: model
title: Sentence Entity Resolver for Clinical Abbreviations and Acronyms (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_clinical_abbreviation_acronym
date: 2022-02-01
tags: [en, entity_resolution, clinical, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.4
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical abbreviations and acronyms to their meanings using `sbiobert_base_cased_mli` Sentence Bert Embeddings. This model is an improved version of the base model, and includes more variational data.
## Predicted Entities
`Abbreviation Meanings`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_abbreviation_acronym_en_3.3.4_3.0_1643681527227.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_abbreviation_acronym_en_3.3.4_3.0_1643681527227.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["document", "token"])\
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \
.setInputCols(["document", "token", "word_embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["document", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(['ABBR'])
sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["document", "ner_chunk"])\
.setOutputCol("sentence_embeddings")\
.setChunkWeight(0.5)\
.setCaseSensitive(True)
abbr_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("abbr_meaning")\
.setDistanceFunction("EUCLIDEAN")\
resolver_pipeline = Pipeline(
stages = [
document_assembler,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
sentence_chunk_embeddings,
abbr_resolver
])
model = resolver_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))
sample_text = "Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."
abbr_result = model.transform(spark.createDataFrame([[sample_text]]).toDF('text'))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token", "word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("document", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("ABBR"))
val sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols(Array("document", "ner_chunk"))
.setOutputCol("sentence_embeddings")
.setChunkWeight(0.5)
.setCaseSensitive(True)
val abbr_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models")
.setInputCols(Array("sentence_embeddings"))
.setOutputCol("abbr_meaning")
.setDistanceFunction("EUCLIDEAN")
val resolver_pipeline = new Pipeline().setStages(document_assembler, tokenizer, word_embeddings, clinical_ner, ner_converter, sentence_chunk_embeddings, abbr_resolver)
val sample_text = Seq("""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""").toDS().toDF("text")
val abbr_result = resolver_pipeline.fit(sample_text).transform(sample_text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.clinical_abbreviation_acronym").predict("""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""")
```
## Results
```bash
| | chunk | abbr_meaning | all_k_results |
|---:|:--------|:-------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | CBC | Complete Blood Count | Complete Blood Count:::Complete blood count:::blood group in ABO system:::(complement) component 4:::abortion:::carbohydrate antigen:::clear to auscultation:::carcinoembryonic antigen:::cervical (level) 4 |
| 1 | AB | blood group in ABO system | blood group in ABO system:::abortion |
| 2 | VDRL | Venereal disease research laboratory | Venereal disease research laboratory:::venous blood gas:::leukocyte esterase:::vertical banded gastroplasty |
| 3 | HIV | human immunodeficiency virus | human immunodeficiency virus:::blood group in ABO system:::abortion:::fluorescent in situ hybridization |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_clinical_abbreviation_acronym|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[output]|
|Language:|en|
|Size:|112.3 MB|
|Case sensitive:|true|
## References
Trained on in-house curated dataset.
---
layout: model
title: Legal Closing Clause Binary Classifier
author: John Snow Labs
name: legclf_closing_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `closing` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `closing`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_closing_clause_en_1.0.0_3.2_1660123306835.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_closing_clause_en_1.0.0_3.2_1660123306835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[closing]|
|[other]|
|[other]|
|[closing]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_closing_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
closing 0.94 0.91 0.93 56
other 0.97 0.98 0.97 143
accuracy - - 0.96 199
macro-avg 0.95 0.94 0.95 199
weighted-avg 0.96 0.96 0.96 199
```
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from saraks)
author: John Snow Labs
name: distilbert_qa_cuad_effective_date_08_31_v1
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-effective_date-08-31-v1` is a English model originally trained by `saraks`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_effective_date_08_31_v1_en_4.3.0_3.0_1672766163916.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_effective_date_08_31_v1_en_4.3.0_3.0_1672766163916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_effective_date_08_31_v1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_effective_date_08_31_v1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_cuad_effective_date_08_31_v1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/saraks/cuad-distil-effective_date-08-31-v1
---
layout: model
title: Named Entity Recognition (NER) Model in Norwegian (Norne 840B 300)
author: John Snow Labs
name: norne_840B_300
date: 2020-05-06
task: Named Entity Recognition
language: "no"
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [ner, norne, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Norne is a Named Entity Recognition (or NER) model of Norvegian, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Norne 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline.
{:.h2_title}
## Predicted Entities
Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Derived-`DRV`, Product-`PROD`, Geo-political Entities Location-`GPE_LOC`, Geo-political Entities Organization-`GPE-ORG`, Event-`EVT`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_NO/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_NO.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/norne_840B_300_no_2.5.0_2.4_1588781290267.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/norne_840B_300_no_2.5.0_2.4_1588781290267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained("norne_840B_300", "no") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.']], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained('glove_840B_300', lang='xx')
.setInputCols(Array('document', 'token'))
.setOutputCol('embeddings')
val ner_model = NerDLModel.pretrained("norne_840B_300", "no")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella."""]
ner_df = nlu.load('no.ner.norne.glove.840B_300').predict(text, output_level = "chunk")
ner_df[["entities", "entities_confidence"]]
```
{:.h2_title}
## Results
```bash
+-------------------------------+---------+
|chunk |ner_label|
+-------------------------------+---------+
|William Henry Gates III |PER |
|Microsoft Corporation |ORG |
|Microsoft |ORG |
|Gates |PER |
|CEO |PER |
|Seattle |GPE_LOC |
|Washington |GPE_LOC |
|Microsoft |ORG |
|Paul Allen |PER |
|Albuquerque |GPE_LOC |
|New Mexico |GPE_LOC |
|Gates |PER |
|Gates |PER |
|Gates |PER |
|Microsoft |ORG |
|Bill & Melinda Gates Foundation|ORG |
|Melinda Gates |PER |
|Ray Ozzie |PER |
|Craig Mundie |PER |
|Microsoft |ORG |
+-------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|norne_840B_300|
|Type:|ner|
|Compatibility:| Spark NLP 2.5.0+|
|Edition:|Official|
|License:|Open Source|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|no|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.559.pdf](https://www.aclweb.org/anthology/2020.lrec-1.559.pdf)
---
layout: model
title: Detect Person, Organization, Location, Facility, Product and Event entities in Persian (persian_w2v_cc_300d)
author: John Snow Labs
name: personer_cc_300d
date: 2020-12-07
task: Named Entity Recognition
language: fa
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [ner, fa, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses Persian word embeddings to find 6 different types of entities in Persian text. It is trained using `persian_w2v_cc_300d` word embeddings, so please use the same embeddings in the pipeline.
## Predicted Entities
Persons-`PER`, Facilities-`FAC`, Products-`PRO`, Locations-`LOC`, Organizations-`ORG`, Events-`EVENT`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/personer_cc_300d_fa_2.7.0_2.4_1607339059321.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/personer_cc_300d_fa_2.7.0_2.4_1607339059321.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
word_embeddings = WordEmbeddingsModel.pretrained("persian_w2v_cc_300d", "fa") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner = NerDLModel.pretrained("personer_cc_300d", "fa") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علیاکبر موسوی خوئینی و شمسالدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند")
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("persian_w2v_cc_300d", "fa")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("personer_cc_300d", "fa")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علیاکبر موسوی خوئینی و شمسالدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("fa.ner").predict("""به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علیاکبر موسوی خوئینی و شمسالدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند""")
```
## Results
```bash
| | ner_chunk | entity |
|---:|--------------------------:|-------------:|
| 0 | خبرنگار ایرنا | ORG |
| 1 | محمد قمی | PER |
| 2 | پاکدشت | LOC |
| 3 | علیاکبر موسوی خوئینی | PER |
| 4 | شمسالدین وهابی | PER |
| 5 | تهران | LOC |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|personer_cc_300d|
|Type:|ner|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token, word_embeddings]|
|Output Labels:|[ner]|
|Language:|fa|
|Dependencies:|persian_w2v_cc_300d|
## Data Source
This model is trained on data provided by [https://www.aclweb.org/anthology/C16-1319/](https://www.aclweb.org/anthology/C16-1319/).
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|:--------------|------:|------:|-----:|---------:|---------:|---------:|
| 0 | B-Per | 1035 | 99 | 75 | 0.912698 | 0.932432 | 0.92246 |
| 1 | I-Fac | 239 | 42 | 64 | 0.850534 | 0.788779 | 0.818493 |
| 2 | I-Pro | 173 | 52 | 158 | 0.768889 | 0.522659 | 0.622302 |
| 3 | I-Loc | 221 | 68 | 66 | 0.764706 | 0.770035 | 0.767361 |
| 4 | I-Per | 652 | 38 | 55 | 0.944928 | 0.922207 | 0.933429 |
| 5 | B-Org | 1118 | 289 | 348 | 0.794598 | 0.762619 | 0.778281 |
| 6 | I-Org | 1543 | 237 | 240 | 0.866854 | 0.865395 | 0.866124 |
| 7 | I-Event | 486 | 130 | 108 | 0.788961 | 0.818182 | 0.803306 |
| 8 | B-Loc | 974 | 252 | 168 | 0.794454 | 0.85289 | 0.822635 |
| 9 | B-Fac | 123 | 31 | 44 | 0.798701 | 0.736527 | 0.766355 |
| 10 | B-Pro | 168 | 81 | 97 | 0.674699 | 0.633962 | 0.653697 |
| 11 | B-Event | 126 | 52 | 51 | 0.707865 | 0.711864 | 0.709859 |
| 12 | Macro-average | 6858 | 1371 | 1474 | 0.805657 | 0.776463 | 0.790791 |
| 13 | Micro-average | 6858 | 1371 | 1474 | 0.833394 | 0.823092 | 0.828211 |
```
---
layout: model
title: BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on QNLI
author: John Snow Labs
name: bert_wiki_books_qnli
date: 2021-08-30
tags: [en, open_source, wikipedia_dataset, bert_embeddings, qnli_dataset, books_corpus_dataset]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on QNLI.
This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_wiki_books_qnli_en_3.2.0_3.0_1630322335414.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_wiki_books_qnli_en_3.2.0_3.0_1630322335414.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = BertEmbeddings.pretrained("bert_wiki_books_qnli", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
```
```scala
val embeddings = BertEmbeddings.pretrained("bert_wiki_books_qnli", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert.wiki_books_qnli').predict(text, output_level='token')
embeddings_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_wiki_books_qnli|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Case sensitive:|false|
## Data Source
[1]: [Wikipedia dataset](https://dumps.wikimedia.org/)
[2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb)
[3]: [QNLI dataset](https://gluebenchmark.com/)
This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/qnli/2
---
layout: model
title: Detect Cellular/Molecular Biology Entities
author: John Snow Labs
name: ner_cellular_en
date: 2020-04-22
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.4.2
spark_version: 2.4
tags: [ner, en, clinical, licensed]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for molecular biology related terms. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
`DNA`, `Cell_type`, `Cell_line`, `RNA`, `Protein`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CELLULAR/){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_en_2.4.2_2.4_1587513308751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_en_2.4.2_2.4_1587513308751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %}
```python
...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
cellular_ner = NerDLModel.pretrained("ner_cellular", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, cellular_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. ']], ["text"]))
```
```scala
...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val celular_ner = NerDLModel.pretrained("ner_cellular", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, cellular_ner, ner_converter))
val data = Seq("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline.
```bash
|chunk |ner |
+-----------------------------------------------------------+---------+
|intracellular signaling proteins |protein |
|human T-cell leukemia virus type 1 promoter |DNA |
|Tax |protein |
|Tax-responsive element 1 |DNA |
|cyclic AMP-responsive members |protein |
|CREB/ATF family |protein |
|transcription factors |protein |
|Tax |protein |
|human T-cell leukemia virus type 1 Tax-responsive element 1|DNA |
|TRE-1), |DNA |
|lacZ gene |DNA |
|CYC1 promoter |DNA |
|TRE-1 |DNA |
|cyclic AMP response element-binding protein |protein |
|CREB |protein |
|CREB |protein |
|GAL4 activation domain |protein |
|GAD |protein |
|reporter gene |DNA |
|Tax |protein |
+-----------------------------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_cellular|
|Type:|ner|
|Compatibility:|Spark NLP 2.4.2|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence,token, embeddings]|
|Output Labels:|[ner]|
|Language:|[en]|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on the JNLPBA corpus containing more than 2.404 publication abstracts with ``'embeddings_clinical'``.
http://www.geniaproject.org/
{:.h2_title}
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|:--------------|-------:|------:|-----:|---------:|---------:|---------:|
| 0 | B-cell_line | 377 | 203 | 123 | 0.65 | 0.754 | 0.698148 |
| 1 | I-DNA | 1519 | 277 | 266 | 0.845768 | 0.85098 | 0.848366 |
| 2 | I-protein | 3981 | 911 | 786 | 0.813778 | 0.835116 | 0.824309 |
| 3 | B-protein | 4483 | 1433 | 579 | 0.757776 | 0.885618 | 0.816724 |
| 4 | I-cell_line | 786 | 340 | 203 | 0.698046 | 0.794742 | 0.743262 |
| 5 | I-RNA | 178 | 42 | 9 | 0.809091 | 0.951872 | 0.874693 |
| 6 | B-RNA | 99 | 28 | 19 | 0.779528 | 0.838983 | 0.808163 |
| 7 | B-cell_type | 1440 | 294 | 480 | 0.83045 | 0.75 | 0.788177 |
| 8 | I-cell_type | 2431 | 377 | 559 | 0.865741 | 0.813044 | 0.838565 |
| 9 | B-DNA | 814 | 267 | 240 | 0.753006 | 0.772296 | 0.762529 |
| 10 | Macro-average | 16108 | 4172 | 3264 | 0.780318 | 0.824665 | 0.801879 |
| 11 | Micro-average | 16108 | 4172 | 3264 | 0.79428 | 0.831509 | 0.812469 |
```
---
layout: model
title: Named Entity Recognition for Japanese (XLM-RoBERTa)
author: John Snow Labs
name: ner_ud_gsd_xlm_roberta_base
date: 2021-09-15
tags: [ja, ner, open_source]
task: Named Entity Recognition
language: ja
edition: Spark NLP 3.2.2
spark_version: 3.0
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together.
This model uses the pretrained XlmRoBertaEmbeddings embeddings "xlm_roberta_base" as an input, so be sure to use the same embeddings in the pipeline.
## Predicted Entities
`ORDINAL`, `PERSON`, `LAW`, `MOVEMENT`, `LOC`, `WORK_OF_ART`, `DATE`, `NORP`, `TITLE_AFFIX`, `QUANTITY`, `FAC`, `TIME`, `MONEY`, `LANGUAGE`, `GPE`, `EVENT`, `ORG`, `PERCENT`, `PRODUCT`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_xlm_roberta_base_ja_3.2.2_3.0_1631696644878.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_xlm_roberta_base_ja_3.2.2_3.0_1631696644878.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
import sparknlp
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.training import *
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = XlmRoBertaEmbeddings.pretrained() \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nerTagger = NerDLModel.pretrained("ner_ud_gsd_xlm_roberta_base", "ja") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
pipeline = Pipeline().setStages(
[
documentAssembler,
sentence,
word_segmenter,
embeddings,
nerTagger,
]
)
data = spark.createDataFrame([["宮本茂氏は、日本の任天堂のゲームプロデューサーです。"]]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(arrays_zip(token.result, ner.result))").show()
```
```scala
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{SentenceDetector, WordSegmenterModel}
import com.johnsnowlabs.nlp.embeddings.XlmRoBertaEmbeddings
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = XlmRoBertaEmbeddings.pretrained("japanese_cc_300d", "ja")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val nerTagger = NerDLModel.pretrained("ner_ud_gsd_xlm_roberta_base", "ja")
.setInputCols("sentence", "token")
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
word_segmenter,
embeddings,
nerTagger
))
val data = Seq("宮本茂氏は、日本の任天堂のゲームプロデューサーです。").toDF("text")
val model = pipeline.fit(data)
val result = model.transform(data)
result.selectExpr("explode(arrays_zip(token.result, ner.result))").show()
```
{:.nlu-block}
```python
import nlu
nlu.load("ja.ner.ud_gsd_xlm_roberta_base").predict("""explode(arrays_zip(token.result, ner.result))""")
```
## Results
```bash
+-------------------+
| col|
+-------------------+
| {宮本, B-PERSON}|
| {茂, I-PERSON}|
| {氏, O}|
| {は, O}|
| {、, O}|
| {日本, B-GPE}|
| {の, O}|
| {任天, B-ORG}|
| {堂, I-ORG}|
| {の, O}|
| {ゲーム, O}|
|{プロデューサー, O}|
| {です, O}|
| {。, O}|
+-------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_ud_gsd_xlm_roberta_base|
|Type:|ner|
|Compatibility:|Spark NLP 3.2.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ja|
|Dependencies:|xlm_roberta_base|
## Data Source
The model was trained on the Universal Dependencies, curated by Google. A NER version was created by megagonlabs:
https://github.com/megagonlabs/UD_Japanese-GSD
Reference:
Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018.
## Benchmarking
```bash
label precision recall f1-score support
DATE 0.93 0.97 0.95 206
EVENT 0.78 0.48 0.60 52
FAC 0.80 0.68 0.73 59
GPE 0.88 0.81 0.85 102
LANGUAGE 1.00 1.00 1.00 8
LAW 0.82 0.69 0.75 13
LOC 0.87 0.83 0.85 41
MONEY 1.00 1.00 1.00 20
MOVEMENT 0.67 0.55 0.60 11
NORP 0.84 0.86 0.85 57
O 0.99 0.99 0.99 11785
ORDINAL 0.94 0.94 0.94 32
ORG 0.71 0.78 0.74 179
PERCENT 1.00 1.00 1.00 16
PERSON 0.89 0.90 0.89 127
PRODUCT 0.56 0.68 0.61 50
QUANTITY 0.92 0.96 0.94 172
TIME 0.91 1.00 0.96 32
TITLE_AFFIX 0.86 0.75 0.80 24
WORK_OF_ART 0.87 0.85 0.86 48
accuracy - - 0.98 13034
macro-avg 0.86 0.84 0.85 13034
weighted-avg 0.98 0.98 0.98 13034
```
---
layout: model
title: Catalan RobertaForQuestionAnswering (from thatdramebaazguy)
author: John Snow Labs
name: roberta_qa_roberta_base_squad
date: 2022-06-20
tags: [ca, open_source, question_answering, roberta]
task: Question Answering
language: ca
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad` is a Catalan model originally trained by `thatdramebaazguy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad_ca_4.0.0_3.0_1655734774977.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad_ca_4.0.0_3.0_1655734774977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad","ca") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_squad","ca")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|ca|
|Size:|461.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/thatdramebaazguy/roberta-base-squad
---
layout: model
title: Slovenian T5ForConditionalGeneration Small Cased model (from cjvt)
author: John Snow Labs
name: t5_legacy_sl_small
date: 2023-01-30
tags: [sl, open_source, t5, tensorflow]
task: Text Generation
language: sl
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legacy-t5-sl-small` is a Slovenian model originally trained by `cjvt`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_legacy_sl_small_sl_4.3.0_3.0_1675104880094.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_legacy_sl_small_sl_4.3.0_3.0_1675104880094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_legacy_sl_small","sl") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_legacy_sl_small","sl")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_legacy_sl_small|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|sl|
|Size:|178.9 MB|
## References
- https://huggingface.co/cjvt/legacy-t5-sl-small
---
layout: model
title: Arabic BertForMaskedLM Base Cased model (from Geotrend)
author: John Snow Labs
name: bert_embeddings_base_ar_cased
date: 2022-12-02
tags: [ar, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ar
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-ar-cased` is a Arabic model originally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ar_cased_ar_4.2.4_3.0_1670015694662.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ar_cased_ar_4.2.4_3.0_1670015694662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ar_cased","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ar_cased","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_ar_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|344.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-ar-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: French Legal Roberta Embeddings
author: John Snow Labs
name: roberta_large_french_legal
date: 2023-02-16
tags: [fr, french, embeddings, transformer, open_source, legal, tensorflow]
task: Embeddings
language: fr
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-french-roberta-large` is a French model originally trained by `joelito`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_french_legal_fr_4.2.4_3.0_1676556919312.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_french_legal_fr_4.2.4_3.0_1676556919312.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_large_french_legal|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
https://huggingface.co/joelito/legal-french-roberta-large
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_base_squad2.0
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2.0_en_4.3.0_3.0_1674219848563.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2.0_en_4.3.0_3.0_1674219848563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2.0","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_squad2.0|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|460.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/roberta-base_squad2.0
---
layout: model
title: Part of Speech for Breton
author: John Snow Labs
name: pos_ud_keb
date: 2021-03-09
tags: [part_of_speech, open_source, breton, pos_ud_keb, br]
task: Part of Speech Tagging
language: br
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- ADV
- VERB
- PUNCT
- NOUN
- PART
- ADJ
- ADP
- NUM
- DET
- X
- PROPN
- PRON
- CCONJ
- SCONJ
- SYM
- INTJ
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_keb_br_3.0.0_3.0_1615292153000.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_keb_br_3.0.0_3.0_1615292153000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_keb", "br") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['Hello from John Snow Labs!']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_keb", "br")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("Hello from John Snow Labs!").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs!""]
token_df = nlu.load('br.pos').predict(text)
token_df
```
## Results
```bash
token pos
0 Hello PRON
1 from VERB
2 John ADJ
3 Snow PROPN
4 Labs PROPN
5 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_keb|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|br|
---
layout: model
title: Legal Further assurances Clause Binary Classifier
author: John Snow Labs
name: legclf_further_assurances_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `further-assurances` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `further-assurances`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_further_assurances_clause_en_1.0.0_3.2_1660122474814.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_further_assurances_clause_en_1.0.0_3.2_1660122474814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[further-assurances]|
|[other]|
|[other]|
|[further-assurances]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_further_assurances_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
further-assurances 1.00 1.00 1.00 41
other 1.00 1.00 1.00 99
accuracy - - 1.00 140
macro-avg 1.00 1.00 1.00 140
weighted-avg 1.00 1.00 1.00 140
```
---
layout: model
title: English BertForQuestionAnswering model (from AnonymousSub)
author: John Snow Labs
name: bert_qa_fpdm_hier_bert_FT_newsqa
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_hier_bert_FT_newsqa` is a English model orginally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_hier_bert_FT_newsqa_en_4.0.0_3.0_1654187869149.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_hier_bert_FT_newsqa_en_4.0.0_3.0_1654187869149.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fpdm_hier_bert_FT_newsqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_fpdm_hier_bert_FT_newsqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.news.bert.fpdm_hier_ft.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_fpdm_hier_bert_FT_newsqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/fpdm_hier_bert_FT_newsqa
---
layout: model
title: Russian RoBERTa Embeddings (from blinoff)
author: John Snow Labs
name: roberta_embeddings_roberta_base_russian_v0
date: 2022-04-14
tags: [roberta, embeddings, ru, open_source]
task: Embeddings
language: ru
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-russian-v0` is a Russian model orginally trained by `blinoff`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_russian_v0_ru_3.4.2_3.0_1649947793512.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_russian_v0_ru_3.4.2_3.0_1649947793512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_russian_v0","ru") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Я люблю искра NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_russian_v0","ru")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Я люблю искра NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ru.embed.roberta_base_russian_v0").predict("""Я люблю искра NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_roberta_base_russian_v0|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ru|
|Size:|468.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/blinoff/roberta-base-russian-v0
---
layout: model
title: Typo Detector Pipeline for Icelandic
author: John Snow Labs
name: distilbert_token_classifier_typo_detector_pipeline
date: 2022-06-25
tags: [icelandic, typo, ner, distilbert, is, open_source]
task: Named Entity Recognition
language: is
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [distilbert_token_classifier_typo_detector_is](https://nlp.johnsnowlabs.com/2022/01/19/distilbert_token_classifier_typo_detector_is.html).
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/TYPO_DETECTOR_IS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/DistilBertForTokenClassification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_pipeline_is_4.0.0_3.0_1656119193097.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_pipeline_is_4.0.0_3.0_1656119193097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
typo_pipeline = PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "is")
typo_pipeline.annotate("Það er miög auðvelt að draga marktækar álykanir af texta með Spark NLP.")
```
```scala
val typo_pipeline = new PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "is")
typo_pipeline.annotate("Það er miög auðvelt að draga marktækar álykanir af texta með Spark NLP.")
```
## Results
```bash
+--------+---------+
|chunk |ner_label|
+--------+---------+
|miög |PO |
|álykanir|PO |
+--------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_typo_detector_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|is|
|Size:|505.8 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- DistilBertForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Detect Diagnosis, Symptoms, Drugs, Labs and Demographics (ner_jsl_enriched)
author: John Snow Labs
name: ner_jsl_enriched_en
date: 2020-04-22
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.4.2
spark_version: 2.4
tags: [ner, en, clinical, licensed]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
Definitions of Predicted Entities:
- `Age`: All mention of ages, past or present, related to the patient or with anybody else.
- `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Drug_Name`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients.
- `Frequency`: Frequency of administration for a dose prescribed.
- `Gender`: Gender-specific nouns and pronouns.
- `Symptom`: All the symptoms mentioned in the document, of a patient or someone else.
- `Allergen`: Allergen related extractions mentioned in the document.
- `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted.
- `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately.
- `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements.
- `Procedure`: All mentions of invasive medical or surgical procedures or treatments.
- `Pulse`: Peripheral heart rate, without advanced information like measurement location.
- `Respiration`: Number of breaths per minute.
- `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels).
- `Temperature`: All mentions that refer to body temperature.
- `Weight`: All mentions related to a patients weight.
## Predicted Entities
`Age`, `Diagnosis`, `Dosage`, `Drug_Name`, `Frequency`, `Gender`, `Lab_Name`, `Lab_Result`, `Symptom_Name`, `Allergenic_substance`, `Blood_Pressure`, `Causative_Agents_(Virus_and_Bacteria)`, `Modifier`, `Name`, `Negation`, `O2_Saturation`, `Procedure`, `Procedure_Name`, `Pulse_Rate`, `Respiratory_Rate`, `Route`, `Section_Name`, `Substance_Name`, `Temperature`, `Weight`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_en_2.4.2_2.4_1587513303751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_en_2.4.2_2.4_1587513303751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %}
```python
...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"]))
```
```scala
...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_jsl_enriched", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline.
```bash
+---------------------------+------------+
|chunk |ner |
+---------------------------+------------+
|21-day-old |Age |
|male |Gender |
|congestion |Symptom_Name|
|mom |Gender |
|suctioning yellow discharge|Symptom_Name|
|she |Gender |
|problems with his breathing|Symptom_Name|
|perioral cyanosis |Symptom_Name|
|retractions |Symptom_Name|
|mom |Gender |
|Tylenol |Drug_Name |
|His |Gender |
|his |Gender |
|respiratory congestion |Symptom_Name|
|He |Gender |
|tired |Symptom_Name|
|fussy |Symptom_Name|
|albuterol |Drug_Name |
+---------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_jsl_enriched_en_2.4.2_2.4|
|Type:|ner|
|Compatibility:|Spark NLP 2.4.2|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence,token, embeddings]|
|Output Labels:|[ner]|
|Language:|[en]|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on data gathered and manually annotated by John Snow Labs.
https://www.johnsnowlabs.com/data/
{:.h2_title}
## Benchmarking
```bash
label tp fp fn prec rec f1
B-Pulse_Rate 80 26 9 0.754717 0.898876 0.820513
I-Diagnosis 2341 1644 1129 0.587453 0.67464 0.628035
I-Procedure_Name 2209 1128 1085 0.661972 0.670613 0.666265
B-Lab_Result 432 107 263 0.801484 0.621583 0.700162
B-Dosage 465 179 81 0.72205 0.851648 0.781513
I-Causative_Agents_(Virus_and_Bacteria) 9 3 10 0.75 0.473684 0.580645
B-Name 648 295 510 0.687169 0.559585 0.616849
I-Name 917 427 665 0.682292 0.579646 0.626794
B-Weight 52 25 9 0.675325 0.852459 0.753623
B-Symptom_Name 4244 1911 1776 0.689521 0.704983 0.697166
I-Maybe 25 15 63 0.625 0.284091 0.390625
I-Symptom_Name 1920 1584 2503 0.547945 0.434095 0.48442
B-Modifier 1399 704 942 0.66524 0.597608 0.629613
B-Blood_Pressure 82 21 7 0.796117 0.921348 0.854167
B-Frequency 290 93 97 0.75718 0.749354 0.753247
I-Gender 29 19 25 0.604167 0.537037 0.568627
I-Age 3 6 11 0.333333 0.214286 0.26087
B-Drug_Name 1762 500 271 0.778957 0.866699 0.820489
B-Substance_Name 143 32 53 0.817143 0.729592 0.770889
B-Temperature 58 23 11 0.716049 0.84058 0.773333
B-Section_Name 2700 294 177 0.901804 0.938478 0.919775
I-Route 131 165 177 0.442568 0.425325 0.433775
B-Maybe 108 47 164 0.696774 0.397059 0.505855
B-Gender 5156 685 68 0.882726 0.986983 0.931948
I-Dosage 435 182 87 0.705024 0.833333 0.763828
B-Causative_Agents_(Virus_and_Bacteria) 21 17 6 0.552632 0.777778 0.646154
I-Frequency 278 131 191 0.679707 0.592751 0.633257
B-Age 352 34 21 0.911917 0.9437 0.927536
I-Lab_Result 27 20 170 0.574468 0.137056 0.221311
B-Negation 1501 311 341 0.828366 0.814875 0.821565
B-Diagnosis 2657 1281 1049 0.674708 0.716945 0.695186
I-Section_Name 3876 1304 188 0.748263 0.95374 0.838598
B-Route 466 286 123 0.619681 0.791172 0.695004
I-Negation 80 152 190 0.344828 0.296296 0.318725
B-Procedure_Name 1453 739 562 0.662865 0.721092 0.690754
I-Allergenic_substance 6 1 7 0.857143 0.461538 0.6
B-Allergenic_substance 74 31 23 0.704762 0.762887 0.732673
I-Weight 46 43 17 0.516854 0.730159 0.605263
B-Lab_Name 639 189 287 0.771739 0.690065 0.72862
I-Modifier 104 156 417 0.4 0.199616 0.266325
I-Temperature 2 7 13 0.222222 0.133333 0.166667
I-Drug_Name 334 237 290 0.584939 0.535256 0.558996
I-Lab_Name 271 157 140 0.633178 0.659367 0.646007
B-Respiratory_Rate 46 6 5 0.884615 0.901961 0.893204
Macro-average 37896 15237 14343 0.621144 0.562248 0.59023
Micro-average 37896 15237 14343 0.713229 0.725435 0.71928
```
---
layout: model
title: English asr_Part1 TFWav2Vec2ForCTC from zasheza
author: John Snow Labs
name: pipeline_asr_Part1
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Part1` is a English model originally trained by zasheza.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Part1_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Part1_en_4.2.0_3.0_1664039779675.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Part1_en_4.2.0_3.0_1664039779675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_Part1', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_Part1", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_Part1|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from datarpit)
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_natural_questions
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-natural-questions` is a English model originally trained by `datarpit`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_natural_questions_en_4.3.0_3.0_1672768123339.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_natural_questions_en_4.3.0_3.0_1672768123339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_natural_questions","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_natural_questions","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_natural_questions|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/datarpit/distilbert-base-uncased-finetuned-natural-questions
---
layout: model
title: Legal Non competition Clause Binary Classifier
author: John Snow Labs
name: legclf_non_competition_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `non-competition` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `non-competition`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_competition_clause_en_1.0.0_3.2_1660122728584.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_competition_clause_en_1.0.0_3.2_1660122728584.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[non-competition]|
|[other]|
|[other]|
|[non-competition]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_non_competition_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
non-competition 1.00 0.89 0.94 18
other 0.97 1.00 0.99 74
accuracy - - 0.98 92
macro-avg 0.99 0.94 0.96 92
weighted-avg 0.98 0.98 0.98 92
```
---
layout: model
title: English T5ForConditionalGeneration Cased model (from dbernsohn)
author: John Snow Labs
name: t5_numbers_gcd
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5_numbers_gcd` is a English model originally trained by `dbernsohn`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_numbers_gcd_en_4.3.0_3.0_1675156829112.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_numbers_gcd_en_4.3.0_3.0_1675156829112.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_numbers_gcd","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_numbers_gcd","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_numbers_gcd|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|283.1 MB|
## References
- https://huggingface.co/dbernsohn/t5_numbers_gcd
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://www.tensorflow.org/datasets/catalog/math_dataset#mathdatasetnumbers_gcd
- https://github.com/DorBernsohn/CodeLM/tree/main/MathLM
- https://www.linkedin.com/in/dor-bernsohn-70b2b1146/
---
layout: model
title: Pipeline to Detect Clinical Entities (ner_eu_clinical_case - fr)
author: John Snow Labs
name: ner_eu_clinical_case_pipeline
date: 2023-03-08
tags: [fr, clinical, licensed, ner]
task: Named Entity Recognition
language: fr
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_eu_clinical_case](https://nlp.johnsnowlabs.com/2023/02/01/ner_eu_clinical_case_fr.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_fr_4.3.0_3.2_1678261744783.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_fr_4.3.0_3.2_1678261744783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_eu_clinical_case_pipeline", "fr", "clinical/models")
text = "
Un garçon de 3 ans atteint d'un trouble autistique à l'hôpital du service pédiatrique A de l'hôpital universitaire. Il n'a pas d'antécédents familiaux de troubles ou de maladies du spectre autistique. Le garçon a été diagnostiqué avec un trouble de communication sévère, avec des difficultés d'interaction sociale et un traitement sensoriel retardé. Les tests sanguins étaient normaux (thyréostimuline (TSH), hémoglobine, volume globulaire moyen (MCV) et ferritine). L'endoscopie haute a également montré une tumeur sous-muqueuse provoquant une obstruction subtotale de la sortie gastrique. Devant la suspicion d'une tumeur stromale gastro-intestinale, une gastrectomie distale a été réalisée. L'examen histopathologique a révélé une prolifération de cellules fusiformes dans la couche sous-muqueuse.
"
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_eu_clinical_case_pipeline", "fr", "clinical/models")
val text = "
Un garçon de 3 ans atteint d'un trouble autistique à l'hôpital du service pédiatrique A de l'hôpital universitaire. Il n'a pas d'antécédents familiaux de troubles ou de maladies du spectre autistique. Le garçon a été diagnostiqué avec un trouble de communication sévère, avec des difficultés d'interaction sociale et un traitement sensoriel retardé. Les tests sanguins étaient normaux (thyréostimuline (TSH), hémoglobine, volume globulaire moyen (MCV) et ferritine). L'endoscopie haute a également montré une tumeur sous-muqueuse provoquant une obstruction subtotale de la sortie gastrique. Devant la suspicion d'une tumeur stromale gastro-intestinale, une gastrectomie distale a été réalisée. L'examen histopathologique a révélé une prolifération de cellules fusiformes dans la couche sous-muqueuse.
"
val result = pipeline.fullAnnotate(text)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_ud_bhtb", "bh") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
pos
])
example = spark.createDataFrame([['ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई ।']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_ud_bhtb", "bh")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई ।").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई ।"]
pos_df = nlu.load('bh.pos').predict(text)
pos_df
```
## Results
```bash
+------------------------------------------------------------+----------------------------------------------------------------------------------+
|text |result |
+------------------------------------------------------------+----------------------------------------------------------------------------------+
|ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई ।|[DET, NOUN, ADP, NOUN, VERB, SCONJ, ADJ, VERB, PROPN, ADP, NOUN, VERB, AUX, PUNCT]|
+------------------------------------------------------------+----------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_bhtb|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[pos]|
|Language:|bh|
## Data Source
The model was trained on the [Universal Dependencies](http://universaldependencies.org) version 2.7.
Reference:
- Ojha, A. K., & Zeman, D. (2020). Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri. Proceedings of the WILDRE5{--} 5th Workshop on Indian Language Data: Resources and Evaluation.
## Benchmarking
```bash
| pos | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| ADJ | 0.92 | 0.92 | 0.92 | 250 |
| ADP | 0.95 | 0.96 | 0.96 | 989 |
| ADV | 0.85 | 0.88 | 0.86 | 32 |
| AUX | 0.93 | 0.95 | 0.94 | 355 |
| CCONJ | 0.95 | 0.95 | 0.95 | 151 |
| DET | 0.96 | 0.95 | 0.95 | 353 |
| INTJ | 1.00 | 1.00 | 1.00 | 5 |
| NOUN | 0.95 | 0.96 | 0.96 | 1854 |
| NUM | 0.97 | 0.98 | 0.97 | 149 |
| PART | 0.94 | 0.93 | 0.93 | 192 |
| PRON | 0.95 | 0.94 | 0.95 | 335 |
| PROPN | 0.94 | 0.94 | 0.94 | 419 |
| PUNCT | 0.97 | 0.96 | 0.96 | 695 |
| SCONJ | 1.00 | 0.96 | 0.98 | 118 |
| VERB | 0.95 | 0.93 | 0.94 | 767 |
| X | 0.50 | 1.00 | 0.67 | 1 |
| accuracy | | | 0.95 | 6665 |
| macro avg | 0.92 | 0.95 | 0.93 | 6665 |
| weighted avg | 0.95 | 0.95 | 0.95 | 6665 |
```
---
layout: model
title: Pipeline to Detect Clinical Entities (BertForTokenClassifier)
author: John Snow Labs
name: bert_token_classifier_ner_jsl_pipeline
date: 2022-03-23
tags: [licensed, ner, clinical, bertfortokenclassification, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_jsl](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_jsl_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_3.4.1_3.0_1648044551434.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_3.4.1_3.0_1648044551434.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.bert_token_ner_jsl.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
+--------------------------------+------------+
|chunk |ner_label |
+--------------------------------+------------+
|21-day-old |Age |
|Caucasian male |Demographics|
|congestion |Symptom |
|mom |Demographics|
|yellow discharge |Symptom |
|nares |Body_Part |
|she |Demographics|
|mild problems with his breathing|Symptom |
|perioral cyanosis |Symptom |
|retractions |Symptom |
|One day ago |Date_Time |
|mom |Demographics|
|tactile temperature |Symptom |
|Tylenol |Drug |
|Baby-girl |Age |
|decreased p.o. intake |Symptom |
|His |Demographics|
|breast-feeding |Body_Part |
|his |Demographics|
|respiratory congestion |Symptom |
+--------------------------------+------------+
only showing top 20 rows
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_jsl_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.5 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverter
---
layout: model
title: English BertForTokenClassification Small Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4_original_PubmedBert_small
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4-original-PubmedBert_small` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_original_PubmedBert_small_en_4.0.0_3.0_1657108249459.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_original_PubmedBert_small_en_4.0.0_3.0_1657108249459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_original_PubmedBert_small","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_original_PubmedBert_small","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4_original_PubmedBert_small|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|408.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4-original-PubmedBert_small
---
layout: model
title: SNOMED ChunkResolver
author: John Snow Labs
name: chunkresolve_snomed_findings_clinical
class: ChunkEntityResolverModel
language: en
nav_key: models
repository: clinical/models
date: 2020-06-20
task: Entity Resolution
edition: Healthcare NLP 2.5.1
spark_version: 2.4
tags: [clinical,licensed,entity_resolution,en]
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance.
## Predicted Entities
Snomed Codes and their normalized definition with `clinical_embeddings`.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_snomed_findings_clinical_en_2.5.1_2.4_1592617161564.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_snomed_findings_clinical_en_2.5.1_2.4_1592617161564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
...
snomed_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")\
.setInputCols("token","chunk_embeddings")\
.setOutputCol("snomed_resolution")
pipeline_snomed = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, snomed_ner_converter, chunk_embeddings, snomed_resolver])
data = ["""Pentamidine 300 mg IV q . 36 hours , Pentamidine nasal wash 60 mg per 6 ml of sterile water q.d . , voriconazole 200 mg p.o . b.i.d . , acyclovir 400 mg p.o . b.i.d . , cyclosporine 50 mg p.o . b.i.d . , prednisone 60 mg p.o . q.d . , GCSF 480 mcg IV q.d . , Epogen 40,000 units subcu q . week , Protonix 40 mg q.d . , Simethicone 80 mg p.o . q . 8 , nitroglycerin paste 1 " ; q . 4 h . p.r.n . , flunisolide nasal inhaler , 2 puffs q . 8 , OxyCodone 10-15 mg p.o . q . 6 p.r.n . , Sudafed 30 mg q . 6 p.o . p.r.n . , Fluconazole 2% cream b.i.d . to erythematous skin lesions , Ditropan 5 mg p.o . b.i.d . , Tylenol 650 mg p.o . q . 4 h . p.r.n . , Ambien 5-10 mg p.o . q . h.s . p.r.n . , Neurontin 100 mg q . a.m . , 200 mg q . p.m . , Aquaphor cream b.i.d . p.r.n . , Lotrimin 1% cream b.i.d . to feet , Dulcolax 5-10 mg p.o . q.d . p.r.n . , Phoslo 667 mg p.o . t.i.d . , Peridex 0.12% , 15 ml p.o . b.i.d . mouthwash , Benadryl 25-50 mg q . 4-6 h . p.r.n . pruritus , Sarna cream q.d . p.r.n . pruritus , Nystatin 5 ml p.o . q.i.d . swish and !""",
"""Albuterol nebulizers 2.5 mg q.4h . and Atrovent nebulizers 0.5 mg q.4h . , please alternate albuterol and Atrovent ; Rocaltrol 0.25 mcg per NG tube q.d .; calcium carbonate 1250 mg per NG tube q.i.d .; vitamin B12 1000 mcg IM q . month , next dose is due Nov 18 ; diltiazem 60 mg per NG tube t.i.d .; ferrous sulfate 300 mg per NG t.i.d .; Haldol 5 mg IV q.h.s .; hydralazine 10 mg IV q.6h . p.r.n . hypertension ; lisinopril 10 mg per NG tube q.d .; Ativan 1 mg per NG tube q.h.s .; Lopressor 25 mg per NG tube t.i.d .; Zantac 150 mg per NG tube b.i.d .; multivitamin 10 ml per NG tube q.d .; Macrodantin 100 mg per NG tube q.i.d . x 10 days beginning on 11/3/00 .""",
"""Tylenol 650 mg p.o . q . 4-6h p.r.n . headache or pain ; acyclovir 400 mg p.o . t.i.d .; acyclovir topical t.i.d . to be applied to lesion on corner of mouth ; Peridex 15 ml p.o . b.i.d .; Mycelex 1 troche p.o . t.i.d .; g-csf 404 mcg subcu q.d .; folic acid 1 mg p.o . q.d .; lorazepam 1-2 mg p.o . q . 4-6h p.r.n . nausea and vomiting ; Miracle Cream topical q.d . p.r.n . perianal irritation ; Eucerin Cream topical b.i.d .; Zantac 150 mg p.o . b.i.d .; Restoril 15-30 mg p.o . q . h.s . p.r.n . insomnia ; multivitamin 1 tablet p.o . q.d .; viscous lidocaine 15 ml p.o . q . 3h can be applied to corner of mouth or lips p.r.n . pain control ."""]
model = pipeline_snomed.fit(spark.createDataFrame([['']]).toDF("text"))
results = model.transform(spark.createDataFrame([['William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd "s werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella.']], ["text"]))
```
```scala
...
val snomed_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")
.setInputCols("token","chunk_embeddings")
.setOutputCol("snomed_resolution")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, snomed_ner_converter, chunk_embeddings, snomed_resolver))
val data = Array("""Pentamidine 300 mg IV q . 36 hours , Pentamidine nasal wash 60 mg per 6 ml of sterile water q.d . , voriconazole 200 mg p.o . b.i.d . , acyclovir 400 mg p.o . b.i.d . , cyclosporine 50 mg p.o . b.i.d . , prednisone 60 mg p.o . q.d . , GCSF 480 mcg IV q.d . , Epogen 40,000 units subcu q . week , Protonix 40 mg q.d . , Simethicone 80 mg p.o . q . 8 , nitroglycerin paste 1 " ; q . 4 h . p.r.n . , flunisolide nasal inhaler , 2 puffs q . 8 , OxyCodone 10-15 mg p.o . q . 6 p.r.n . , Sudafed 30 mg q . 6 p.o . p.r.n . , Fluconazole 2% cream b.i.d . to erythematous skin lesions , Ditropan 5 mg p.o . b.i.d . , Tylenol 650 mg p.o . q . 4 h . p.r.n . , Ambien 5-10 mg p.o . q . h.s . p.r.n . , Neurontin 100 mg q . a.m . , 200 mg q . p.m . , Aquaphor cream b.i.d . p.r.n . , Lotrimin 1% cream b.i.d . to feet , Dulcolax 5-10 mg p.o . q.d . p.r.n . , Phoslo 667 mg p.o . t.i.d . , Peridex 0.12% , 15 ml p.o . b.i.d . mouthwash , Benadryl 25-50 mg q . 4-6 h . p.r.n . pruritus , Sarna cream q.d . p.r.n . pruritus , Nystatin 5 ml p.o . q.i.d . swish and !""",
"""Albuterol nebulizers 2.5 mg q.4h . and Atrovent nebulizers 0.5 mg q.4h . , please alternate albuterol and Atrovent ; Rocaltrol 0.25 mcg per NG tube q.d .; calcium carbonate 1250 mg per NG tube q.i.d .; vitamin B12 1000 mcg IM q . month , next dose is due Nov 18 ; diltiazem 60 mg per NG tube t.i.d .; ferrous sulfate 300 mg per NG t.i.d .; Haldol 5 mg IV q.h.s .; hydralazine 10 mg IV q.6h . p.r.n . hypertension ; lisinopril 10 mg per NG tube q.d .; Ativan 1 mg per NG tube q.h.s .; Lopressor 25 mg per NG tube t.i.d .; Zantac 150 mg per NG tube b.i.d .; multivitamin 10 ml per NG tube q.d .; Macrodantin 100 mg per NG tube q.i.d . x 10 days beginning on 11/3/00 .""",
"""Tylenol 650 mg p.o . q . 4-6h p.r.n . headache or pain ; acyclovir 400 mg p.o . t.i.d .; acyclovir topical t.i.d . to be applied to lesion on corner of mouth ; Peridex 15 ml p.o . b.i.d .; Mycelex 1 troche p.o . t.i.d .; g-csf 404 mcg subcu q.d .; folic acid 1 mg p.o . q.d .; lorazepam 1-2 mg p.o . q . 4-6h p.r.n . nausea and vomiting ; Miracle Cream topical q.d . p.r.n . perianal irritation ; Eucerin Cream topical b.i.d .; Zantac 150 mg p.o . b.i.d .; Restoril 15-30 mg p.o . q . h.s . p.r.n . insomnia ; multivitamin 1 tablet p.o . q.d .; viscous lidocaine 15 ml p.o . q . 3h can be applied to corner of mouth or lips p.r.n . pain control .""")
val result = pipeline.fit(Seq.empty[String]).transform(data)
```
{:.h2_title}
## Results
```bash
+-----------------------------------------------------------------------------+-------+----------------------------------------------------------------------------------------------------+-----------------+----------+
| chunk| entity| target_text| code|confidence|
+-----------------------------------------------------------------------------+-------+----------------------------------------------------------------------------------------------------+-----------------+----------+
| erythematous skin lesions|PROBLEM|Skin lesion:::Achromic skin lesions of pinta:::Scaly skin:::Skin constricture:::Cratered skin les...| 95324001| 0.0937|
| pruritus|PROBLEM|Pruritus:::Genital pruritus:::Postmenopausal pruritus:::Pruritus hiemalis:::Pruritus ani:::Anogen...| 418363000| 0.1394|
| pruritus|PROBLEM|Pruritus:::Genital pruritus:::Postmenopausal pruritus:::Pruritus hiemalis:::Pruritus ani:::Anogen...| 418363000| 0.1394|
| hypertension|PROBLEM|Hypertension:::Renovascular hypertension:::Idiopathic hypertension:::Venous hypertension:::Resist...| 38341003| 0.1019|
| headache or pain|PROBLEM|Pain:::Headache:::Postchordotomy pain:::Throbbing pain:::Aching headache:::Postspinal headache:::...| 22253000| 0.0953|
| applied to lesion on corner of mouth|PROBLEM|Lesion of tongue:::Erythroleukoplakia of mouth:::Lesion of nose:::Lesion of oropharynx:::Erythrop...| 300246005| 0.0547|
| nausea and vomiting|PROBLEM|Nausea and vomiting:::Vomiting without nausea:::Nausea:::Intractable nausea and vomiting:::Vomiti...| 16932000| 0.0995|
| perianal irritation|PROBLEM|Perineal irritation:::Vulval irritation:::Skin irritation:::Perianal pain:::Perianal itch:::Vagin...| 281639001| 0.0764|
| insomnia|PROBLEM|Insomnia:::Mood insomnia:::Nonorganic insomnia:::Persistent insomnia:::Psychophysiologic insomnia...| 193462001| 0.1198|
+-----------------------------------------------------------------------------+-------+----------------------------------------------------------------------------------------------------+-----------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|---------------------------------------|
| Name: | chunkresolve_snomed_findings_clinical |
| Type: | ChunkEntityResolverModel |
| Compatibility: | Spark NLP 2.5.1+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | [token, chunk_embeddings ] |
|Output labels: | [entity] |
| Language: | en |
| Case sensitive: | True |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on SNOMED CT Findings
http://www.snomed.org/
---
layout: model
title: English ElectraForQuestionAnswering model (from navteca)
author: John Snow Labs
name: electra_qa_base_squad2
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-squad2` is a English model originally trained by `navteca`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_squad2_en_4.0.0_3.0_1655920731292.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_squad2_en_4.0.0_3.0_1655920731292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_squad2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.electra.base.by_navteca").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_base_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|408.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/navteca/electra-base-squad2
- https://rajpurkar.github.io/SQuAD-explorer/
---
layout: model
title: Detect Genes and Human Phenotypes
author: John Snow Labs
name: ner_human_phenotype_gene_clinical
date: 2020-09-21
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.6.0
spark_version: 2.4
tags: [ner, en, licensed, clinical]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model detects mentions of genes and human phenotypes (hp) in medical text.
## Predicted Entities
`GENE`, `HP`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HUMAN_PHENOTYPE_GENE_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_HUMAN_PHENOTYPE_GENE_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_en_2.5.5_2.4_1598558253840.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_en_2.5.5_2.4_1598558253840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).")
```
```scala
...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.human_phenotype.gene_clinical").predict("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""")
```
{:.h2_title}
## Results
```bash
+----+------------------+---------+-------+----------+
| | chunk | begin | end | entity |
+====+==================+=========+=======+==========+
| 0 | BS type | 29 | 32 | GENE |
+----+------------------+---------+-------+----------+
| 1 | polyhydramnios | 75 | 88 | HP |
+----+------------------+---------+-------+----------+
| 2 | polyuria | 91 | 98 | HP |
+----+------------------+---------+-------+----------+
| 3 | nephrocalcinosis | 101 | 116 | HP |
+----+------------------+---------+-------+----------+
| 4 | hypokalemia | 122 | 132 | HP |
+----+------------------+---------+-------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_human_phenotype_gene_clinical|
|Type:|ner|
|Compatibility:|Healthcare NLP 2.6.0 +|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|[en]|
|Case sensitive:|false|
## Data source
This model was trained with data from https://github.com/lasigeBioTM/PGR
For further details please refer to https://aclweb.org/anthology/papers/N/N19/N19-1152/
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|--------------:|------:|-----:|-----:|---------:|---------:|---------:|
| 0 | I-HP | 303 | 56 | 64 | 0.844011 | 0.825613 | 0.834711 |
| 1 | B-GENE | 1176 | 158 | 252 | 0.881559 | 0.823529 | 0.851557 |
| 2 | B-HP | 1078 | 133 | 96 | 0.890173 | 0.918228 | 0.903983 |
| 3 | Macro-average | 2557 | 347 | 412 | 0.871915 | 0.85579 | 0.863777 |
| 4 | Micro-average | 2557 | 347 | 412 | 0.88051 | 0.861233 | 0.870765 |
```
---
layout: model
title: Fast Neural Machine Translation Model from Bulgarian to English
author: John Snow Labs
name: opus_mt_bg_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, bg, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `bg`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bg_en_xx_2.7.0_2.4_1609169845170.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bg_en_xx_2.7.0_2.4_1609169845170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_bg_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_bg_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.bg.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_bg_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: BERT Sequence Classification - Detecting Hate Speech (bert_sequence_classifier_hatexplain)
author: John Snow Labs
name: bert_sequence_classifier_hatexplain
date: 2021-11-06
tags: [bert_for_sequence_classification, hate, hate_speech, speech, offensive, en, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 3.3.2
spark_version: 2.4
supported: true
annotator: BertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is imported from `Hugging Face-models`and it is used for classifying a text as `Hate speech`, `Offensive`, or `Normal`. The model is trained using data from Gab and Twitter and Human Rationales were included as part of the training data to boost the performance.
- Citing :
```bash
@article{mathew2020hatexplain,
title={HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection},
author={Mathew, Binny and Saha, Punyajoy and Yimam, Seid Muhie and Biemann, Chris and Goyal, Pawan and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2012.10289},
year={2020}
}
```
## Predicted Entities
`hate speech`, `normal`, `offensive`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_hatexplain_en_3.3.2_2.4_1636214446271.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_hatexplain_en_3.3.2_2.4_1636214446271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_hatexplain', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)
pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])
example = spark.createDataFrame([['I love you very much!']]).toDF("text")
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_hatexplain", "en")
.setInputCols("document", "token")
.setOutputCol("class")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val example = Seq.empty["I love you very much!"].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
```
## Results
```bash
['normal']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_hatexplain|
|Compatibility:|Spark NLP 3.3.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[label]|
|Language:|en|
|Case sensitive:|true|
## Data Source
[https://huggingface.co/Hate-speech-CNERG/bert-base-uncased-hatexplain](https://huggingface.co/Hate-speech-CNERG/bert-base-uncased-hatexplain)
## Benchmarking
```bash
+-------+------------+--------+
| Acc | Macro F1 | AUROC |
+-------+------------+--------+
| 0.698 | 0.687 | 0.851 |
+-------+------------+--------+
```
---
layout: model
title: English Deberta Embeddings model (from domenicrosati)
author: John Snow Labs
name: deberta_embeddings_mlm_test
date: 2023-03-13
tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow]
task: Embeddings
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DeBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-mlm-test` is a English model originally trained by `domenicrosati`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_mlm_test_en_4.3.1_3.0_1678702297278.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_mlm_test_en_4.3.1_3.0_1678702297278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_mlm_test","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_mlm_test","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|deberta_embeddings_mlm_test|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|267.4 MB|
|Case sensitive:|false|
## References
https://huggingface.co/domenicrosati/deberta-mlm-test
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_el8_dl4
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el8-dl4` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_dl4_en_4.3.0_3.0_1675120674389.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_dl4_en_4.3.0_3.0_1675120674389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_el8_dl4","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_el8_dl4","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_el8_dl4|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|143.8 MB|
## References
- https://huggingface.co/google/t5-efficient-small-el8-dl4
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: RoBERTa Large CoNLL-03 NER Pipeline
author: John Snow Labs
name: roberta_large_token_classifier_conll03_pipeline
date: 2022-06-19
tags: [open_source, ner, token_classifier, roberta, conll03, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [roberta_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/roberta_large_token_classifier_conll03_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654476076.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654476076.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs."))
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PERSON |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_large_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.3 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- RoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Detect PHI for Deidentification (Augmented)
author: John Snow Labs
name: ner_deid_augmented
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Deidentification NER (Augmented) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified.
We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/)
## Predicted Entities
`AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_en_3.0.0_3.0_1617208449273.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_en_3.0.0_3.0_1617208449273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_deid_augmented","en","clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. ']], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_deid_augmented","en","clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings_clinical,
ner,
ner_converter))
val data = Seq("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. """).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.deid.augmented").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. """)
```
## Results
```bash
+---------------+---------+
|chunk |ner_label|
+---------------+---------+
|Smith |NAME |
|VA Hospital |LOCATION |
|John Green |NAME |
|2347165768 |ID |
|Day Hospital |LOCATION |
|02/04/2003 |DATE |
|Smith |NAME |
|Day Hospital |LOCATION |
|Smith |NAME |
|Smith |NAME |
|7 Ardmore Tower|LOCATION |
|Hart |NAME |
|Smith |NAME |
|02/07/2003 |DATE |
+---------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_augmented|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
Trained on plain n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with embeddings_clinical https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|--------------:|------:|------:|------:|---------:|---------:|---------:|
| 0 | I-NAME | 1096 | 47 | 80 | 0.95888 | 0.931973 | 0.945235 |
| 1 | I-CONTACT | 93 | 0 | 4 | 1 | 0.958763 | 0.978947 |
| 2 | I-AGE | 3 | 1 | 6 | 0.75 | 0.333333 | 0.461538 |
| 3 | B-DATE | 2078 | 42 | 52 | 0.980189 | 0.975587 | 0.977882 |
| 4 | I-DATE | 474 | 39 | 25 | 0.923977 | 0.9499 | 0.936759 |
| 5 | I-LOCATION | 755 | 68 | 76 | 0.917375 | 0.908544 | 0.912938 |
| 6 | I-PROFESSION | 78 | 8 | 9 | 0.906977 | 0.896552 | 0.901734 |
| 7 | B-NAME | 1182 | 101 | 36 | 0.921278 | 0.970443 | 0.945222 |
| 8 | B-AGE | 259 | 10 | 11 | 0.962825 | 0.959259 | 0.961039 |
| 9 | B-ID | 146 | 8 | 11 | 0.948052 | 0.929936 | 0.938907 |
| 10 | B-PROFESSION | 76 | 9 | 21 | 0.894118 | 0.783505 | 0.835165 |
| 11 | B-LOCATION | 556 | 87 | 71 | 0.864697 | 0.886762 | 0.875591 |
| 12 | I-ID | 64 | 8 | 3 | 0.888889 | 0.955224 | 0.920863 |
| 13 | B-CONTACT | 40 | 7 | 5 | 0.851064 | 0.888889 | 0.869565 |
| 14 | Macro-average | 6900 | 435 | 410 | 0.912023 | 0.880619 | 0.896046 |
| 15 | Micro-average | 6900 | 435 | 410 | 0.940695 | 0.943912 | 0.942301 |
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Haitian Creole
author: John Snow Labs
name: opus_mt_en_ht
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ht, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ht`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ht_xx_2.7.0_2.4_1609163766018.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ht_xx_2.7.0_2.4_1609163766018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ht", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ht", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ht').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ht|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering model (from Kutay)
author: John Snow Labs
name: bert_qa_fine_tuned_tweetqa_aip
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fine_tuned_tweetqa_aip` is a English model orginally trained by `Kutay`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fine_tuned_tweetqa_aip_en_4.0.0_3.0_1654187707453.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fine_tuned_tweetqa_aip_en_4.0.0_3.0_1654187707453.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fine_tuned_tweetqa_aip","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_fine_tuned_tweetqa_aip","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.trivia.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_fine_tuned_tweetqa_aip|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Kutay/fine_tuned_tweetqa_aip
---
layout: model
title: Extract relations between drugs and proteins (ReDL)
author: John Snow Labs
name: redl_drugprot_biobert
date: 2023-01-14
tags: [relation_extraction, clinical, en, licensed, tensorflow]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Detect interactions between chemical compounds/drugs and genes/proteins using BERT by classifying whether a specified semantic relation holds between a chemical and gene entities within a sentence or document. The entity labels used during training were derived from the custom NER model created by our team for the DrugProt corpus. These include CHEMICAL for chemical compounds/drugs, GENE for genes/proteins and GENE_AND_CHEMICAL for entity mentions of type GENE and of type CHEMICAL that overlap (such as enzymes and small peptides). The relation categories from the DrugProt corpus were condensed from 13 categories to 10 categories due to low numbers of examples for certain categories. This merging process involved grouping the SUBSTRATE_PRODUCT-OF and SUBSTRATE relation categories together and grouping the AGONIST-ACTIVATOR, AGONIST-INHIBITOR and AGONIST relation categories together.
## Predicted Entities
`INHIBITOR`, `DIRECT-REGULATOR`, `SUBSTRATE`, `ACTIVATOR`, `INDIRECT-UPREGULATOR`, `INDIRECT-DOWNREGULATOR`, `ANTAGONIST`, `PRODUCT-OF`, `PART-OF`, `AGONIST`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_drugprot_biobert_en_4.2.4_3.0_1673736326031.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_drugprot_biobert_en_4.2.4_3.0_1673736326031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
In the table below, `redl_drugprot_biobert` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated.
| RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS |
|:---------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------:|------------------------------------------------------------------------------------|
| redl_drugprot_biobert | INHIBITOR, DIRECT-REGULATOR, SUBSTRATE, ACTIVATOR, INDIRECT-UPREGULATOR, INDIRECT-DOWNREGULATOR, ANTAGONIST, PRODUCT-OF, PART-OF, AGONIST | ner_drugprot_clinical | [“checmical-gene”, “chemical-gene_and_chemical”, “gene_and_chemical-gene”] |
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
drugprot_ner_tagger = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
dependency_parser = DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
# Set a filter on pairs of named entities which will be treated as relation candidates
drugprot_re_ner_chunk_filter = RENerChunksFilter()\
.setInputCols(["ner_chunks", "dependencies"])\
.setOutputCol("re_ner_chunks")\
.setMaxSyntacticDistance(4)
# .setRelationPairs(['CHEMICAL-GENE'])
drugprot_re_Model = RelationExtractionDLModel()\
.pretrained('redl_drugprot_biobert', "en", "clinical/models")\
.setPredictionThreshold(0.9)\
.setInputCols(["re_ner_chunks", "sentences"])\
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, drugprot_ner_tagger, ner_converter, pos_tagger, dependency_parser, drugprot_re_ner_chunk_filter, drugprot_re_Model])
text='''Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.'''
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val drugprot_ner_tagger = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val drugprot_re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
// .setRelationPairs(Array("CHEMICAL-GENE"))
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val drugprot_re_Model = RelationExtractionDLModel()
.pretrained("redl_drugprot_biobert", "en", "clinical/models")
.setPredictionThreshold(0.9)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, drugprot_ner_tagger, ner_converter, pos_tagger, dependency_parser, drugprot_re_ner_chunk_filter, drugprot_re_Model))
val data = Seq("""Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.drugprot").predict("""Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_wobert_chinese_plus","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_wobert_chinese_plus","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.wobert_chinese_plus").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_wobert_chinese_plus|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|467.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/qinluo/wobert-chinese-plus
- https://github.com/ZhuiyiTechnology/WoBERT
- https://github.com/JunnYu/WoBERT_pytorch
---
layout: model
title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18)
author: John Snow Labs
name: roberta_qa_base_spanish_squades_becas1
date: 2023-01-20
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becas1` is a Spanish model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becas1_es_4.3.0_3.0_1674217912605.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becas1_es_4.3.0_3.0_1674217912605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becas1","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becas1","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_spanish_squades_becas1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|460.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becas1
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_urdu_proj TFWav2Vec2ForCTC from MSaudTahir
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_urdu_proj` is a English model originally trained by MSaudTahir.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj_en_4.2.0_3.0_1664102013221.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj_en_4.2.0_3.0_1664102013221.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Politics And Public Safety Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_politics_and_public_safety_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, politics_and_public_safety, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_politics_and_public_safety_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Politics_and_Public_Safety or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Politics_and_Public_Safety`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_politics_and_public_safety_bert_en_1.0.0_3.0_1678111789915.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_politics_and_public_safety_bert_en_1.0.0_3.0_1678111789915.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[voting-agreement]|
|[other]|
|[other]|
|[voting-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_voting_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.97 0.98 0.98 65
voting-agreement 0.97 0.94 0.95 33
accuracy - - 0.97 98
macro-avg 0.97 0.96 0.97 98
weighted-avg 0.97 0.97 0.97 98
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Hiri Motu
author: John Snow Labs
name: opus_mt_en_ho
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ho, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ho`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ho_xx_2.7.0_2.4_1609169379532.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ho_xx_2.7.0_2.4_1609169379532.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ho", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ho", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ho').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ho|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: ALBERT Embeddings (XXLarge Uncase)
author: John Snow Labs
name: albert_xxlarge_uncased
date: 2020-04-28
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: AlBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The details are described in the paper "[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.](https://arxiv.org/abs/1909.11942)"
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_xxlarge_uncased_en_2.5.0_2.4_1588073588232.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_xxlarge_uncased_en_2.5.0_2.4_1588073588232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = AlbertEmbeddings.pretrained("albert_xxlarge_uncased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = AlbertEmbeddings.pretrained("albert_xxlarge_uncased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.albert.xxlarge_uncased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_albert_xxlarge_uncased_embeddings
I [-0.07972775399684906, 0.06297606974840164, 0....
love [-0.07597140967845917, 0.05237535387277603, 0....
NLP [0.005398618057370186, -0.0253510233014822, 0....
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_xxlarge_uncased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|1024|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/google/albert_xlarge/3](https://tfhub.dev/google/albert_xlarge/3)
---
layout: model
title: Explain Document Pipeline for Portuguese
author: John Snow Labs
name: explain_document_sm
date: 2021-03-22
tags: [open_source, portuguese, explain_document_sm, pipeline, pt]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: pt
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_pt_3.0.0_3.0_1616422933551.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_pt_3.0.0_3.0_1616422933551.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_sm', lang = 'pt')
annotations = pipeline.fullAnnotate(""Olá de John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_sm", lang = "pt")
val result = pipeline.fullAnnotate("Olá de John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Olá de John Snow Labs! ""]
result_df = nlu.load('pt.explain').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:----------------------------|:---------------------------|:---------------------------------------|:---------------------------------------|:--------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Olá de John Snow Labs! '] | ['Olá de John Snow Labs!'] | ['Olá', 'de', 'John', 'Snow', 'Labs!'] | ['Olá', 'de', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_sm|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|pt|
---
layout: model
title: Legal Land Transport Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_land_transport_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, land_transport, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_land_transport_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Land_Transport or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Land_Transport`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_land_transport_bert_en_1.0.0_3.0_1678111683794.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_land_transport_bert_en_1.0.0_3.0_1678111683794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Land_Transport]|
|[Other]|
|[Other]|
|[Land_Transport]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_land_transport_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Land_Transport 0.87 0.92 0.89 97
Other 0.92 0.88 0.90 104
accuracy - - 0.90 201
macro-avg 0.90 0.90 0.90 201
weighted-avg 0.90 0.90 0.90 201
```
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab1_by_tahazakir TFWav2Vec2ForCTC from tahazakir
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab1_by_tahazakir` is a English model originally trained by tahazakir.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir_en_4.2.0_3.0_1664038802857.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir_en_4.2.0_3.0_1664038802857.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Litigations Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_litigations_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, litigations, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Litigations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Litigations`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_litigations_bert_en_1.0.0_3.0_1678049988540.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_litigations_bert_en_1.0.0_3.0_1678049988540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Litigations]|
|[Other]|
|[Other]|
|[Litigations]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_litigations_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Litigations 0.99 0.97 0.98 125
Other 0.97 0.99 0.98 150
accuracy - - 0.98 275
macro-avg 0.98 0.98 0.98 275
weighted-avg 0.98 0.98 0.98 275
```
---
layout: model
title: Dutch BertForMaskedLM Base Cased model (from Geotrend)
author: John Snow Labs
name: bert_embeddings_base_nl_cased
date: 2022-12-02
tags: [nl, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: nl
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-nl-cased` is a Dutch model originally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_nl_cased_nl_4.2.4_3.0_1670018595773.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_nl_cased_nl_4.2.4_3.0_1670018595773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_nl_cased","nl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_nl_cased","nl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_nl_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|nl|
|Size:|391.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-nl-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Legal Satisfaction And Discharge Of Indenture Clause Binary Classifier
author: John Snow Labs
name: legclf_satisfaction_and_discharge_of_indenture_clause
date: 2023-01-27
tags: [en, legal, classification, satisfaction, discharge, indenture, clauses, satisfaction_and_discharge_of_indenture, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `satisfaction-and-discharge-of-indenture` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`satisfaction-and-discharge-of-indenture`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_satisfaction_and_discharge_of_indenture_clause_en_1.0.0_3.0_1674821475929.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_satisfaction_and_discharge_of_indenture_clause_en_1.0.0_3.0_1674821475929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[satisfaction-and-discharge-of-indenture]|
|[other]|
|[other]|
|[satisfaction-and-discharge-of-indenture]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_satisfaction_and_discharge_of_indenture_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.95 0.97 0.96 36
satisfaction-and-discharge-of-indenture 0.97 0.94 0.95 31
accuracy - - 0.96 67
macro-avg 0.96 0.95 0.95 67
weighted-avg 0.96 0.96 0.96 67
```
---
layout: model
title: Resolve Tickers to Company Names
author: John Snow Labs
name: finel_tickers2names
date: 2022-09-09
tags: [en, finance, companies, tickers, nasdaq, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is an Entity Resolution / Entity Linking model, which is able to provide Company Names given their Ticker / Trading Symbols. You can use any NER which extracts Tickersto then send the output to this Entity Linking model and get the Company Name.
## Predicted Entities
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/financial_company_normalization){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_tickers2names_en_1.0.0_3.2_1662733866127.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_tickers2names_en_1.0.0_3.2_1662733866127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+--------------------+-----------------------------------------------------------------+----------------------------------------------------------------+---------------------------+
| chunk| code | all_codes| resolutions | all_distances|
+-------+--------------------+-----------------------------------------------------------------+----------------------------------------------------------------+---------------------------+
| unit | UNITI GROUP INC. | [UNITI GROUP INC., Uniti Group INC. , Uniti Group Incorporated] |[UNITI GROUP INC., Uniti Group INC. , Uniti Group Incorporated] | [0.0000, 0.0000, 0.0000] |
+-------+--------------------+-----------------------------------------------------------------+----------------------------------------------------------------+---------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finel_tickers2names|
|Type:|finance|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[org_company_name]|
|Language:|en|
|Size:|8.5 MB|
|Case sensitive:|false|
## References
https://data.world/johnsnowlabs/list-of-companies-in-nasdaq-exchanges
---
layout: model
title: Pipeline to Detect Clinical Events
author: John Snow Labs
name: ner_events_clinical_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_events_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_events_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_clinical_pipeline_en_3.4.1_3.0_1647873847549.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_clinical_pipeline_en_3.4.1_3.0_1647873847549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_events_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("The patient presented to the emergency room last evening")
```
```scala
val pipeline = new PretrainedPipeline("ner_events_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("The patient presented to the emergency room last evening")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.events_clinical.pipeline").predict("""The patient presented to the emergency room last evening""")
```
## Results
```bash
+------------------+-------------+
|chunk |ner_label |
+------------------+-------------+
|presented |OCCURRENCE |
|the emergency room|CLINICAL_DEPT|
|last evening |DATE |
+------------------+-------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_events_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: BERT Sequence Classification - Identify Antisemitic texts
author: John Snow Labs
name: bert_sequence_classifier_antisemitism
date: 2021-11-06
tags: [en, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 3.3.2
spark_version: 2.4
supported: true
annotator: BertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is imported from `Hugging Face-models` and it was trained on 4K tweets, where ~50% were labeled as antisemitic. The model identifies if the text is antisemitic or not.
- `1` : Antisemitic
- `0` : Non-antisemitic
## Predicted Entities
`1`, `0`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_antisemitism_en_3.3.2_2.4_1636196636003.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_antisemitism_en_3.3.2_2.4_1636196636003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_antisemitism', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class')
pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])
example = spark.createDataFrame([["The Jews have too much power!"]]).toDF("text")
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_antisemitism", "en")
.setInputCols("document", "token")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val example = Seq.empty["The Jews have too much power!"].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
```
## Results
```bash
['1']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_antisemitism|
|Compatibility:|Spark NLP 3.3.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[label]|
|Language:|en|
|Case sensitive:|true|
## Data Source
[https://huggingface.co/astarostap/autonlp-antisemitism-2-21194454](https://huggingface.co/astarostap/autonlp-antisemitism-2-21194454)
---
layout: model
title: English DistilBertForTokenClassification Base Uncased model (from Datasaur)
author: John Snow Labs
name: distilbert_token_classifier_base_uncased_finetuned_conll2003
date: 2023-03-03
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-conll2003` is a English model originally trained by `Datasaur`.
## Predicted Entities
`LOC`, `ORG`, `PER`, `MISC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll2003_en_4.3.1_3.0_1677881552803.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll2003_en_4.3.1_3.0_1677881552803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll2003","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll2003","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_base_uncased_finetuned_conll2003|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Datasaur/distilbert-base-uncased-finetuned-conll2003
---
layout: model
title: English RobertaForQuestionAnswering (from eAsyle)
author: John Snow Labs
name: roberta_qa_roberta_base_custom_QA
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_base_custom_QA` is a English model originally trained by `eAsyle`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_custom_QA_en_4.0.0_3.0_1655738945694.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_custom_QA_en_4.0.0_3.0_1655738945694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_custom_QA","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_custom_QA","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_custom_QA|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|424.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/eAsyle/roberta_base_custom_QA
---
layout: model
title: Financial NER (sm, Small)
author: John Snow Labs
name: finner_financial_small
date: 2022-10-19
tags: [en, finance, ner, annual, reports, 10k, filings, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: FinanceNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a `sm` (small) version of a financial model, trained with more generic labels than the other versions of the model (`md`, `lg`, ...) you can find in Models Hub.
Please note this model requires some tokenization configuration to extract the currency (see python snippet below).
The aim of this model is to detect the main pieces of financial information in annual reports of companies, more specifically this model is being trained with 10K filings.
The currently available entities are:
- AMOUNT: Numeric amounts, not percentages
- PERCENTAGE: Numeric amounts which are percentages
- CURRENCY: The currency of the amount
- FISCAL_YEAR: A date which expresses which month the fiscal exercise was closed for a specific year
- DATE: Generic dates in context where either it's not a fiscal year or it can't be asserted as such given the context
- PROFIT: Profit or also Revenue
- PROFIT_INCREASE: A piece of information saying there was a profit / revenue increase in that fiscal year
- PROFIT_DECLINE: A piece of information saying there was a profit / revenue decrease in that fiscal year
- EXPENSE: An expense or loss
- EXPENSE_INCREASE: A piece of information saying there was an expense increase in that fiscal year
- EXPENSE_DECREASE: A piece of information saying there was an expense decrease in that fiscal year
You can also check for the Relation Extraction model which connects these entities together
## Predicted Entities
`AMOUNT`, `CURRENCY`, `DATE`, `FISCAL_YEAR`, `PERCENTAGE`, `EXPENSE`, `EXPENSE_INCREASE`, `EXPENSE_DECREASE`, `PROFIT`, `PROFIT_INCREASE`, `PROFIT_DECLINE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_FINANCIAL_10K/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_financial_small_en_1.0.0_3.0_1666185056018.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_financial_small_en_1.0.0_3.0_1666185056018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
ner_model = finance.NerModel.pretrained("finner_financial_small", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")\
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019. The increase was primarily related to increase in internal staff costs of $ 1.1 million as we increased delivery staff and work performed on internal projects, partially offset by a decrease in third party consultant costs of $ 0.6 million as these were converted to internal staff or terminated. Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""]]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("text"),
F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models")
text = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models")
val text = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_labels | confidence |
|---:|:------------------------------------------|--------:|------:|:-----------------------------|-------------:|
| 0 | 21-day-old | 17 | 26 | Age | 0.996622 |
| 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.999759 |
| 2 | male | 38 | 41 | Gender | 0.999847 |
| 3 | 2 days | 52 | 57 | Duration | 0.818646 |
| 4 | congestion | 62 | 71 | Symptom | 0.997344 |
| 5 | mom | 75 | 77 | Gender | 0.999601 |
| 6 | yellow | 99 | 104 | Symptom | 0.476263 |
| 7 | discharge | 106 | 114 | Symptom | 0.704853 |
| 8 | nares | 135 | 139 | External_body_part_or_region | 0.999152 |
| 9 | she | 147 | 149 | Gender | 0.999927 |
| 10 | mild | 168 | 171 | Modifier | 0.999674 |
| 11 | problems with his breathing while feeding | 173 | 213 | Symptom | 0.995353 |
| 12 | perioral cyanosis | 237 | 253 | Symptom | 0.99852 |
| 13 | retractions | 258 | 268 | Symptom | 0.999806 |
| 14 | One day ago | 272 | 282 | RelativeDate | 0.99949 |
| 15 | mom | 285 | 287 | Gender | 0.999779 |
| 16 | tactile temperature | 304 | 322 | Symptom | 0.997475 |
| 17 | Tylenol | 345 | 351 | Drug_BrandName | 0.998978 |
| 18 | Baby-girl | 354 | 362 | Age | 0.990654 |
| 19 | decreased | 382 | 390 | Symptom | 0.996808 |
| 20 | intake | 397 | 402 | Symptom | 0.983608 |
| 21 | His | 405 | 407 | Gender | 0.999922 |
| 22 | breast-feeding | 416 | 429 | External_body_part_or_region | 0.994421 |
| 23 | 20 minutes | 444 | 453 | Duration | 0.992322 |
| 24 | 5 to 10 minutes | 464 | 478 | Duration | 0.969913 |
| 25 | his | 493 | 495 | Gender | 0.999908 |
| 26 | respiratory congestion | 497 | 518 | Symptom | 0.995677 |
| 27 | He | 521 | 522 | Gender | 0.999803 |
| 28 | tired | 555 | 559 | Symptom | 0.999463 |
| 29 | fussy | 574 | 578 | Symptom | 0.996514 |
| 30 | over the past 2 days | 580 | 599 | RelativeDate | 0.998001 |
| 31 | albuterol | 642 | 650 | Drug_Ingredient | 0.99964 |
| 32 | ER | 676 | 677 | Clinical_Dept | 0.998161 |
| 33 | His | 680 | 682 | Gender | 0.999921 |
| 34 | urine output has also decreased | 684 | 714 | Symptom | 0.971606 |
| 35 | he | 726 | 727 | Gender | 0.999916 |
| 36 | per 24 hours | 765 | 776 | Frequency | 0.910935 |
| 37 | he | 783 | 784 | Gender | 0.999922 |
| 38 | per 24 hours | 812 | 823 | Frequency | 0.921849 |
| 39 | Mom | 826 | 828 | Gender | 0.999606 |
| 40 | diarrhea | 841 | 848 | Symptom | 0.999849 |
| 41 | His | 851 | 853 | Gender | 0.999739 |
| 42 | bowel | 855 | 859 | Internal_organ_or_component | 0.999471 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_jsl_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|405.4 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: Brazilian Portuguese NER for Laws (Bert, Base)
author: John Snow Labs
name: legner_br_bert_base
date: 2022-09-28
tags: [pt, legal, ner, laws, licensed]
task: Named Entity Recognition
language: pt
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Deep Learning Portuguese Named Entity Recognition model for the legal domain, trained using Base Bert Embeddings, and is able to predict the following entities:
- ORGANIZACAO (Organizations)
- JURISPRUDENCIA (Jurisprudence)
- PESSOA (Person)
- TEMPO (Time)
- LOCAL (Location)
- LEGISLACAO (Laws)
- O (Other)
You can find different versions of this model in Models Hub:
- With a Deep Learning architecture (non-transformer) and Base Embeddings;
- With a Deep Learning architecture (non-transformer) and Large Embeddings;
- With a Transformers Architecture and Base Embeddings;
- With a Transformers Architecture and Large Embeddings;
## Predicted Entities
`PESSOA`, `ORGANIZACAO`, `LEGISLACAO`, `JURISPRUDENCIA`, `TEMPO`, `LOCAL`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_LEGAL_PT/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_br_bert_base_pt_1.0.0_3.0_1664362186486.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_br_bert_base_pt_1.0.0_3.0_1664362186486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = nlp.BertForTokenClassification.pretrained("legner_br_bert_base","pt", "legal/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = nlp.Pipeline(
stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier])
example = spark.createDataFrame(pd.DataFrame({'text': ["""Mediante do exposto , com fundamento nos artigos 32 , i , e 33 , da lei 8.443/1992 , submetem-se os autos à consideração superior , com posterior encaminhamento ao ministério público junto ao tcu e ao gabinete do relator , propondo : a ) conhecer do recurso e , no mérito , negar-lhe provimento ; b ) comunicar ao recorrente , ao superior tribunal militar e ao tribunal regional federal da 2ª região , a fim de fornecer subsídios para os processos judiciais 2001.34.00.024796-9 e 2003.34.00.044227-3 ; e aos demais interessados a deliberação que vier a ser proferida por esta corte ” ."""]}))
result = pipeline.fit(example).transform(example)
```
## Results
```bash
+-------+
|result|
+-------+
|[Organisation_of_Transport]|
|[Other]|
|[Other]|
|[Organisation_of_Transport]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_organisation_of_transport_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Organisation_of_Transport 0.87 0.90 0.88 188
Other 0.89 0.86 0.88 184
accuracy - - 0.88 372
macro-avg 0.88 0.88 0.88 372
weighted-avg 0.88 0.88 0.88 372
```
---
layout: model
title: English BertForQuestionAnswering Cased model (from roshnir)
author: John Snow Labs
name: bert_qa_mbert_finetuned_mlqa_dev
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-en` is a English model originally trained by `roshnir`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_dev_en_4.0.0_3.0_1657189939497.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_dev_en_4.0.0_3.0_1657189939497.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_dev","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_dev","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_mbert_finetuned_mlqa_dev|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|626.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-en
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4CHEMD_ImbalancedPubMedBERT
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD_ImbalancedPubMedBERT` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_ImbalancedPubMedBERT_en_4.0.0_3.0_1657108954704.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_ImbalancedPubMedBERT_en_4.0.0_3.0_1657108954704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_ImbalancedPubMedBERT","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_ImbalancedPubMedBERT","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4CHEMD_ImbalancedPubMedBERT|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|408.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4CHEMD_ImbalancedPubMedBERT
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from ncduy)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-squad2-distilled-finetuned-chaii` is a English model originally trained by `ncduy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii_en_4.0.0_3.0_1655991564185.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii_en_4.0.0_3.0_1655991564185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_chaii.xlm_roberta.distilled_base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|886.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ncduy/xlm-roberta-base-squad2-distilled-finetuned-chaii
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from mrm8488)
author: John Snow Labs
name: t5_base_finetuned_wikisql
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-wikiSQL` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_wikisql_en_4.3.0_3.0_1675109286457.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_wikisql_en_4.3.0_3.0_1675109286457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_base_finetuned_wikisql","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_base_finetuned_wikisql","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_base_finetuned_wikisql|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|887.6 MB|
## References
- https://huggingface.co/mrm8488/t5-base-finetuned-wikiSQL
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://github.com/salesforce/WikiSQL
- https://arxiv.org/pdf/1910.10683.pdf
- https://i.imgur.com/jVFMMWR.png
- https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb
- https://github.com/patil-suraj
- https://pbs.twimg.com/media/Ec5vaG5XsAINty_?format=png&name=900x900
- https://twitter.com/mrm8488
- https://www.linkedin.com/in/manuel-romero-cs/
---
layout: model
title: Fast Neural Machine Translation Model from Thai to English
author: John Snow Labs
name: opus_mt_th_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, th, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `th`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_th_en_xx_2.7.0_2.4_1609163813254.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_th_en_xx_2.7.0_2.4_1609163813254.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_th_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_th_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.th.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_th_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Translate Pangasinan to English Pipeline
author: John Snow Labs
name: translate_pag_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, pag, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `pag`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_pag_en_xx_2.7.0_2.4_1609686426766.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_pag_en_xx_2.7.0_2.4_1609686426766.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_pag_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_pag_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.pag.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_pag_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from twmkn9)
author: John Snow Labs
name: roberta_qa_base_squad2
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-squad2` is a English model originally trained by `twmkn9`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_en_4.3.0_3.0_1674210478798.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_en_4.3.0_3.0_1674210478798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_squad2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|307.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/twmkn9/distilroberta-base-squad2
---
layout: model
title: Finnish asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot TFWav2Vec2ForCTC from aapot
author: John Snow Labs
name: pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot` is a Finnish model originally trained by aapot.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot_fi_4.2.0_3.0_1664024597770.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot_fi_4.2.0_3.0_1664024597770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot', lang = 'fi')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot", lang = "fi")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English BertForQuestionAnswering model (from Graphcore)
author: John Snow Labs
name: bert_qa_Graphcore_bert_large_uncased_squad
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-squad` is a English model orginally trained by `Graphcore`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Graphcore_bert_large_uncased_squad_en_4.0.0_3.0_1654536525530.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Graphcore_bert_large_uncased_squad_en_4.0.0_3.0_1654536525530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Graphcore_bert_large_uncased_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_Graphcore_bert_large_uncased_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.large_uncased.by_Graphcore").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_Graphcore_bert_large_uncased_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|798.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Graphcore/bert-large-uncased-squad
---
layout: model
title: Italian T5ForConditionalGeneration Small Cased model (from it5)
author: John Snow Labs
name: t5_it5_efficient_small_el32_question_generation
date: 2023-01-30
tags: [it, open_source, t5, tensorflow]
task: Text Generation
language: it
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-question-generation` is a Italian model originally trained by `it5`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_question_generation_it_4.3.0_3.0_1675103595829.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_question_generation_it_4.3.0_3.0_1675103595829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_question_generation","it") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_question_generation","it")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_it5_efficient_small_el32_question_generation|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|it|
|Size:|593.8 MB|
## References
- https://huggingface.co/it5/it5-efficient-small-el32-question-generation
- https://github.com/stefan-it
- https://arxiv.org/abs/2203.03759
- https://gsarti.com
- https://malvinanissim.github.io
- https://arxiv.org/abs/2109.10686
- https://github.com/gsarti/it5
- https://paperswithcode.com/sota?task=Question+generation&dataset=SQuAD-IT
---
layout: model
title: English DistilBertForQuestionAnswering model (from Adrian) Squad2
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_squad_colab
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad-colab` is a English model originally trained by `Adrian`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_colab_en_4.0.0_3.0_1654726625385.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_colab_en_4.0.0_3.0_1654726625385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_colab","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_colab","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased_colab.by_Adrian").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_squad_colab|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Adrian/distilbert-base-uncased-finetuned-squad-colab
---
layout: model
title: Swedish asr_wav2vec2_swedish_common_voice TFWav2Vec2ForCTC from birgermoell
author: John Snow Labs
name: asr_wav2vec2_swedish_common_voice
date: 2022-09-25
tags: [wav2vec2, sv, audio, open_source, asr]
task: Automatic Speech Recognition
language: sv
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_swedish_common_voice` is a Swedish model originally trained by birgermoell.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_swedish_common_voice_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_swedish_common_voice_sv_4.2.0_3.0_1664114373826.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_swedish_common_voice_sv_4.2.0_3.0_1664114373826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_swedish_common_voice", "sv")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_swedish_common_voice", "sv")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_swedish_common_voice|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|sv|
|Size:|1.2 GB|
---
layout: model
title: Smaller BERT Embeddings (L-6_H-128_A-2)
author: John Snow Labs
name: small_bert_L6_128
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L6_128_en_2.6.0_2.4_1598344340449.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L6_128_en_2.6.0_2.4_1598344340449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("small_bert_L6_128", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("small_bert_L6_128", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert.small_L6_128').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_bert_small_L6_128_embeddings
I [0.43105611205101013, 0.6831966638565063, -1.2.....
love [0.8754201531410217, 0.4752326011657715, -1.46...
NLP [-0.2781177759170532, -0.14001458883285522, 1...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|small_bert_L6_128|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|128|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/google/small_bert/bert_uncased_L-6_H-128_A-2/2](https://tfhub.dev/google/small_bert/bert_uncased_L-6_H-128_A-2/2)
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC5CDR-Chem-Modified-PubMedBERT-384` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384_en_4.0.0_3.0_1657109392849.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384_en_4.0.0_3.0_1657109392849.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|408.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC5CDR-Chem-Modified-PubMedBERT-384
---
layout: model
title: Detect Persons, Locations, Organizations and Misc Entities in Portuguese (WikiNER 6B 100)
author: John Snow Labs
name: wikiner_6B_100
date: 2020-05-10
task: Named Entity Recognition
language: pt
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [ner, pt, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline.
{:.h2_title}
## Predicted Entities
Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_PT){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_PT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_pt_2.5.0_2.4_1588495233192.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_pt_2.5.0_2.4_1588495233192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained("glove_100d") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained("wikiner_6B_100", "pt") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella.']], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("glove_100d")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("wikiner_6B_100", "pt")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella."""]
ner_df = nlu.load('pt.ner.wikiner.glove.6B_100').predict(text, output_level = "chunk")
ner_df[["entities", "entities_confidence"]]
```
{:.h2_title}
## Results
```bash
+-----------------------+---------+
|chunk |ner_label|
+-----------------------+---------+
|William Henry Gates III|PER |
|Ele |MISC |
|Microsoft Corporation |ORG |
|Durante |ORG |
|Microsoft |ORG |
|Gates |PER |
|CEO |ORG |
|Ele |MISC |
|Nascido |MISC |
|Seattle |LOC |
|Washington |LOC |
|Gates |PER |
|Microsoft |ORG |
|Paul Allen |PER |
|Albuquerque |LOC |
|Novo México |LOC |
|Gates |PER |
|CEO |ORG |
|Gates |PER |
|Gates |PER |
+-----------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wikiner_6B_100|
|Type:|ner|
|Compatibility:| Spark NLP 2.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|pt|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model was trained based on data from [https://pt.wikipedia.org](https://pt.wikipedia.org)
---
layout: model
title: French CamemBert Embeddings (from Jodsa)
author: John Snow Labs
name: camembert_embeddings_camembert_mlm
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `camembert_mlm` is a French model orginally trained by `Jodsa`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_camembert_mlm_fr_3.4.4_3.0_1653985748924.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_camembert_mlm_fr_3.4.4_3.0_1653985748924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_camembert_mlm","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_camembert_mlm","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_camembert_mlm|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|420.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Jodsa/camembert_mlm
---
layout: model
title: Legal Supplemental Indenture Document Binary Classifier (Longformer)
author: John Snow Labs
name: legclf_supplemental_indenture_agreement
date: 2022-12-18
tags: [en, legal, classification, licensed, document, longformer, supplemental, indenture, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_supplemental_indenture_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `supplemental-indenture` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account.
## Predicted Entities
`supplemental-indenture`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_supplemental_indenture_agreement_en_1.0.0_3.0_1671393678095.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_supplemental_indenture_agreement_en_1.0.0_3.0_1671393678095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[supplemental-indenture]|
|[other]|
|[other]|
|[supplemental-indenture]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_supplemental_indenture_agreement|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.6 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.97 0.95 0.96 221
supplemental-indenture 0.90 0.94 0.92 107
accuracy - - 0.95 328
macro-avg 0.94 0.95 0.94 328
weighted-avg 0.95 0.95 0.95 328
```
---
layout: model
title: Relation Extraction between Test and Results (ReDL)
author: John Snow Labs
name: redl_oncology_test_result_biobert_wip
date: 2023-01-15
tags: [licensed, clinical, oncology, en, relation_extraction, test, tensorflow]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This relation extraction model links test extractions to their corresponding results.
## Predicted Entities
`is_finding_of`, `O`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_test_result_biobert_wip_en_4.2.4_3.0_1673776756086.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_test_result_biobert_wip_en_4.2.4_3.0_1673776756086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Each relevant relation pair in the pipeline should include one test entity (such as Biomarker, Imaging_Test, Pathology_Test or Oncogene) and one result entity (such as Biomarker_Result, Pathology_Result or Tumor_Finding).
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos_tags")
dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \
.setInputCols(["sentence", "pos_tags", "token"]) \
.setOutputCol("dependencies")
re_ner_chunk_filter = RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test"])
re_model = RelationExtractionDLModel.pretrained("redl_oncology_test_result_biobert_wip", "en", "clinical/models")\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relation_extraction")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model])
data = spark.createDataFrame([["Pathology showed tumor cells, which were positive for estrogen and progesterone receptors."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos_tags", "token"))
.setOutputCol("dependencies")
val re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunk", "dependencies"))
.setOutputCol("re_ner_chunk")
.setMaxSyntacticDistance(10)
.setRelationPairs(Array("Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test"))
val re_model = RelationExtractionDLModel.pretrained("redl_oncology_test_result_biobert_wip", "en", "clinical/models")
.setPredictionThreshold(0.5f)
.setInputCols(Array("re_ner_chunk", "sentence"))
.setOutputCol("relation_extraction")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model))
val data = Seq("Pathology showed tumor cells, which were positive for estrogen and progesterone receptors.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.oncology.test_result_biobert").predict("""Pathology showed tumor cells, which were positive for estrogen and progesterone receptors.""")
```
## Results
```bash
+-------------+----------------+-------------+-----------+---------+----------------+-------------+-----------+--------------------+----------+
| relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence|
+-------------+----------------+-------------+-----------+---------+----------------+-------------+-----------+--------------------+----------+
|is_finding_of| Pathology_Test| 0| 8|Pathology|Pathology_Result| 17| 27| tumor cells| 0.8494344|
|is_finding_of|Biomarker_Result| 41| 48| positive| Biomarker| 54| 61| estrogen|0.99451536|
|is_finding_of|Biomarker_Result| 41| 48| positive| Biomarker| 67| 88|progesterone rece...|0.99218905|
+-------------+----------------+-------------+-----------+---------+----------------+-------------+-----------+--------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_oncology_test_result_biobert_wip|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|401.7 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label recall precision f1
O 0.87 0.92 0.9
is_finding_of 0.93 0.88 0.9
macro-avg 0.90 0.90 0.9
```
---
layout: model
title: English image_classifier_vit_diam ViTForImageClassification from godiec
author: John Snow Labs
name: image_classifier_vit_diam
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_diam` is a English model originally trained by godiec.
## Predicted Entities
`bunny`, `moon`, `sun`, `tiger`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_diam_en_4.1.0_3.0_1660167848550.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_diam_en_4.1.0_3.0_1660167848550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_diam", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_diam", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_diam|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SQuAD 2.0
author: John Snow Labs
name: sent_bert_wiki_books_squad2
date: 2021-08-31
tags: [en, open_source, sentence_detection, wikipedia_dataset, books_corpus_dataset, squad_2_dataset]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on SQuAD 2.0. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings.
This model is intended to be used for a variety of English NLP tasks. The pre-training data contains more formal text and the model may not generalize to more colloquial text such as social media or messages.
This model is fine-tuned on the SQuAD 2.0 and is recommended for use in question answering tasks. The fine-tuning task uses the SQuAD 2.0 dataset as a span-labeling task to label the answer to a question in a given context.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_squad2_en_3.2.0_3.0_1630412125790.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_squad2_en_3.2.0_3.0_1630412125790.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_squad2", "en") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
```
```scala
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_squad2", "en")
.setInputCols("sentence")
.setOutputCol("bert_sentence")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
sent_embeddings_df = nlu.load('en.embed_sentence.bert.wiki_books_squad2').predict(text, output_level='sentence')
sent_embeddings_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_wiki_books_squad2|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[bert_sentence]|
|Language:|en|
|Case sensitive:|false|
## Data Source
[1]: [Wikipedia dataset](https://dumps.wikimedia.org/)
[2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb)
[3]: [Stanford Queston Answering (SQuAD 2.0) dataset](https://rajpurkar.github.io/SQuAD-explorer/)
This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/squad2/2
---
layout: model
title: Classify text about Effective, Renewal or Termination date
author: John Snow Labs
name: legclf_dates_sm
date: 2022-11-21
tags: [effective, renewal, termination, date, en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Text Classification model can help you classify if a paragraph talks about an Effective Date, a Renewal Date, a Termination Date or something else. Don't confuse this model with the NER model (`legner_dates_sm`) which allows you to extract the actual dates from the texts.
## Predicted Entities
`EFFECTIVE_DATE`, `RENEWAL_DATE`, `TERMINATION_DATE`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dates_sm_en_1.0.0_3.0_1669034322560.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dates_sm_en_1.0.0_3.0_1669034322560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = legal.ClassifierDLModel.pretrained('legclf_dates_sm', 'en', 'legal/models')\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("label")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
embeddings,
docClassifier])
text = ["""Renewal Date means January 1, 2018."""]
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
res = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+--------------+
| result|
+--------------+
|[RENEWAL_DATE]|
+--------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_dates_sm|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[label]|
|Language:|en|
|Size:|22.5 MB|
## References
In-house annotations.
## Benchmarking
```bash
label precision recall f1-score support
EFFECTIVE_DATE 1.00 0.80 0.89 5
RENEWAL_DATE 1.00 1.00 1.00 6
TERMINATION_DATE 0.86 0.75 0.80 8
other 0.91 1.00 0.95 21
accuracy - - 0.93 40
macro-avg 0.94 0.89 0.91 40
weighted-avg 0.93 0.93 0.92 40
```
---
layout: model
title: Legal Termination Clause Binary Classifier (CUAD dataset, SBERT version)
author: John Snow Labs
name: legclf_sbert_cuad_termination_clause
date: 2022-11-11
tags: [termination, en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
This version was trained with Universal Sentence Encoder. There is another version using Universal Sentence Encoding, called `legclf_cuad_termination_clause`
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
There are other models in this dataset with similar title, but the difference is the dataset it was trained on. This one was trained with `cuad` dataset.
## Predicted Entities
`termination`, `other`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sbert_cuad_termination_clause_en_1.0.0_3.0_1668163200458.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sbert_cuad_termination_clause_en_1.0.0_3.0_1668163200458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("clause_text") \
.setOutputCol("document")
embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = nlp.ClassifierDLModel.pretrained("legclf_sbert_cuad_termination_clause", "en", "legal/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
embeddings,
docClassifier])
df = spark.createDataFrame([[" ---------------------\n\n This Agreement may be terminated immediately by Developer..."]]).toDF("clause_text")
model = nlpPipeline.fit(df)
result = model.transform(df)
```
## Results
```bash
+-------+
| result|
+-------+
|[termination]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_sbert_cuad_termination_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
In-house annotations on CUAD dataset
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 1.00 1.00 41
termination 1.00 1.00 1.00 40
accuracy - - 1.00 81
macro-avg 1.00 1.00 1.00 81
weighted-avg 1.00 1.00 1.00 81
```
---
layout: model
title: German asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350` is a German model originally trained by jonatasgrosman.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350_de_4.2.0_3.0_1664103295967.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350_de_4.2.0_3.0_1664103295967.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350', lang = 'de')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350", lang = "de")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English BertForQuestionAnswering model (from sunitha)
author: John Snow Labs
name: bert_qa_output_files
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `output_files` is a English model orginally trained by `sunitha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_output_files_en_4.0.0_3.0_1654189020709.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_output_files_en_4.0.0_3.0_1654189020709.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_output_files","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_output_files","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.output_files.bert.by_sunitha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_output_files|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sunitha/output_files
---
layout: model
title: English image_classifier_vit_ViT_FaceMask_Finetuned ViTForImageClassification from AkshatSurolia
author: John Snow Labs
name: image_classifier_vit_ViT_FaceMask_Finetuned
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_ViT_FaceMask_Finetuned` is a English model originally trained by AkshatSurolia.
## Predicted Entities
`Mask`, `No Mask`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ViT_FaceMask_Finetuned_en_4.1.0_3.0_1660165872491.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ViT_FaceMask_Finetuned_en_4.1.0_3.0_1660165872491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_ViT_FaceMask_Finetuned", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_ViT_FaceMask_Finetuned", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_ViT_FaceMask_Finetuned|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English DistilBertForQuestionAnswering model (from bdickson)
author: John Snow Labs
name: distilbert_qa_bdickson_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `bdickson`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bdickson_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725099987.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bdickson_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725099987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bdickson_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bdickson_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_bdickson").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_bdickson_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/bdickson/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Icelandic XlmRoBertaForQuestionAnswering (from vesteinn)
author: John Snow Labs
name: xlm_roberta_qa_XLMr_ENIS_QA_Is
date: 2022-06-23
tags: [is, open_source, question_answering, xlmroberta]
task: Question Answering
language: is
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `XLMr-ENIS-QA-Is` is a Icelandic model originally trained by `vesteinn`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_XLMr_ENIS_QA_Is_is_4.0.0_3.0_1655983971257.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_XLMr_ENIS_QA_Is_is_4.0.0_3.0_1655983971257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_XLMr_ENIS_QA_Is","is") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_XLMr_ENIS_QA_Is","is")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("is.answer_question.xlmr_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_XLMr_ENIS_QA_Is|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|is|
|Size:|453.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/vesteinn/XLMr-ENIS-QA-Is
---
layout: model
title: Pipeline to Mapping ICD10-CM Codes with Their Corresponding ICD-9-CM Codes
author: John Snow Labs
name: icd10_icd9_mapping
date: 2023-06-13
tags: [en, licensed, icd10cm, icd9, pipeline, chunk_mapping]
task: Chunk Mapping
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of `icd10_icd9_mapper` model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapping_en_4.4.4_3.2_1686663555396.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapping_en_4.4.4_3.2_1686663555396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models")
result = pipeline.fullAnnotate(Z833 A0100 A000)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models")
val result = pipeline.fullAnnotate(Z833 A0100 A000)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.icd10_icd9.mapping").predict("""Put your text here.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models")
result = pipeline.fullAnnotate(Z833 A0100 A000)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models")
val result = pipeline.fullAnnotate(Z833 A0100 A000)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.icd10_icd9.mapping").predict("""Put your text here.""")
```
## Results
```bash
Results
| | icd10_code | icd9_code |
|---:|:--------------------|:-------------------|
| 0 | Z833 | A0100 | A000 | V180 | 0020 | 0010 |
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|icd10_icd9_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|593.6 KB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ChunkMapperModel
---
layout: model
title: Translate Albanian to English Pipeline
author: John Snow Labs
name: translate_sq_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, sq, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `sq`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sq_en_xx_2.7.0_2.4_1609688416552.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sq_en_xx_2.7.0_2.4_1609688416552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_sq_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_sq_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.sq.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_sq_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Detect Clinical Entities (ner_eu_clinical_case - es)
author: John Snow Labs
name: ner_eu_clinical_case
date: 2023-02-01
tags: [es, clinical, licensed, ner]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.2.8
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition (NER) deep learning model for extracting clinical entities from Spanish texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN.
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Predicted Entities
`clinical_event`, `bodypart`, `clinical_condition`, `units_measurements`, `patient`, `date_time`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_es_4.2.8_3.0_1675285093855.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_es_4.2.8_3.0_1675285093855.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_eu_clinical_case", "es", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["""Un niño de 3 años con trastorno autista en el hospital de la sala pediátrica A del hospital universitario. No tiene antecedentes familiares de enfermedad o trastorno del espectro autista. El niño fue diagnosticado con un trastorno de comunicación severo, con dificultades de interacción social y retraso en el procesamiento sensorial. Los análisis de sangre fueron normales (hormona estimulante de la tiroides (TSH), hemoglobina, volumen corpuscular medio (MCV) y ferritina). La endoscopia alta también mostró un tumor submucoso que causaba una obstrucción subtotal de la salida gástrica. Ante la sospecha de tumor del estroma gastrointestinal, se realizó gastrectomía distal. El examen histopatológico reveló proliferación de células fusiformes en la capa submucosa."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_eu_clinical_case", "es", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("""Un niño de 3 años con trastorno autista en el hospital de la sala pediátrica A del hospital universitario. No tiene antecedentes familiares de enfermedad o trastorno del espectro autista. El niño fue diagnosticado con un trastorno de comunicación severo, con dificultades de interacción social y retraso en el procesamiento sensorial. Los análisis de sangre fueron normales (hormona estimulante de la tiroides (TSH), hemoglobina, volumen corpuscular medio (MCV) y ferritina). La endoscopia alta también mostró un tumor submucoso que causaba una obstrucción subtotal de la salida gástrica. Ante la sospecha de tumor del estroma gastrointestinal, se realizó gastrectomía distal. El examen histopatológico reveló proliferación de células fusiformes en la capa submucosa.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+--------------------------------+------------------+
|chunk |ner_label |
+--------------------------------+------------------+
|Un niño de 3 años |patient |
|trastorno autista |clinical_event |
|antecedentes |clinical_event |
|enfermedad |clinical_event |
|trastorno del espectro autista |clinical_event |
|El niño |patient |
|diagnosticado |clinical_event |
|trastorno de comunicación severo|clinical_event |
|dificultades |clinical_event |
|retraso |clinical_event |
|análisis |clinical_event |
|sangre |bodypart |
|normales |units_measurements|
|hormona |clinical_event |
|la tiroides |bodypart |
|TSH |clinical_event |
|hemoglobina |clinical_event |
|volumen |clinical_event |
|MCV |clinical_event |
|ferritina |clinical_event |
|endoscopia |clinical_event |
|mostró |clinical_event |
|tumor submucoso |clinical_event |
|obstrucción |clinical_event |
|tumor |clinical_event |
|del estroma gastrointestinal |bodypart |
|gastrectomía |clinical_event |
|examen |clinical_event |
|reveló |clinical_event |
|proliferación |clinical_event |
|células fusiformes |bodypart |
|la capa submucosa |bodypart |
+--------------------------------+------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_eu_clinical_case|
|Compatibility:|Healthcare NLP 4.2.8+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|895.1 KB|
## References
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Benchmarking
```bash
label tp fp fn total precision recall f1
date_time 87.0 10.0 17.0 104.0 0.8969 0.8365 0.8657
units_measurements 37.0 5.0 11.0 48.0 0.8810 0.7708 0.8222
clinical_condition 50.0 34.0 70.0 120.0 0.5952 0.4167 0.4902
patient 76.0 8.0 11.0 87.0 0.9048 0.8736 0.8889
clinical_event 399.0 44.0 79.0 478.0 0.9007 0.8347 0.8664
bodypart 153.0 56.0 13.0 166.0 0.7321 0.9217 0.8160
macro - - - - - - 0.7916
micro - - - - - - 0.8128
```
---
layout: model
title: Korean BertForQuestionAnswering model (from bespin-global)
author: John Snow Labs
name: bert_qa_bespin_global_klue_bert_base_mrc
date: 2022-06-02
tags: [ko, open_source, question_answering, bert]
task: Question Answering
language: ko
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `klue-bert-base-mrc` is a Korean model orginally trained by `bespin-global`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bespin_global_klue_bert_base_mrc_ko_4.0.0_3.0_1654188080353.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bespin_global_klue_bert_base_mrc_ko_4.0.0_3.0_1654188080353.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bespin_global_klue_bert_base_mrc","ko") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bespin_global_klue_bert_base_mrc","ko")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.answer_question.klue.bert.base.by_bespin-global").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bespin_global_klue_bert_base_mrc|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|ko|
|Size:|413.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/bespin-global/klue-bert-base-mrc
- https://www.bespinglobal.com/
---
layout: model
title: Fast Neural Machine Translation Model from Salishan Languages to English
author: John Snow Labs
name: opus_mt_sal_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, sal, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `sal`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sal_en_xx_2.7.0_2.4_1609163973743.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sal_en_xx_2.7.0_2.4_1609163973743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_sal_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_sal_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.sal.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_sal_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_by_testimonial TFWav2Vec2ForCTC from testimonial
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab_by_testimonial
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_testimonial` is a English model originally trained by testimonial.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_testimonial_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_testimonial_en_4.2.0_3.0_1664107711019.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_testimonial_en_4.2.0_3.0_1664107711019.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab_by_testimonial", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab_by_testimonial", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab_by_testimonial|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|354.9 MB|
---
layout: model
title: Classify Edgar Financial Filings and Schedules
author: John Snow Labs
name: finclf_sec_schedules_filings
date: 2023-01-13
tags: [sec, filings, schedules, en, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
recommended: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a `multiclass` model, which analyzes the first 512 tokens of your document and retrieves if it is one of the supported classes (see Predicted entities).
The class `schedule` includes `TO-C`, `13D`, `TO-T`, `14F1`, `14D9`, `14N`, `13G`, `TO-I`, `13E3`.
`3` means SEC's `FORM-3`.
`4` means SEC's `FORM-4`.
## Predicted Entities
`schedule`, `other`, `10-K`, `10-Q`, `3`, `4`, `8-K`, `S-8`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_sec_schedules_filings_en_1.0.0_3.0_1673628989895.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_sec_schedules_filings_en_1.0.0_3.0_1673628989895.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
embeddings = nlp.UniversalSentenceEncoder.pretrained()\
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
doc_classifier = finance.ClassifierDLModel.pretrained("finclf_sec_schedules_filings", "en", "finance/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
embeddings,
doc_classifier
])
text = """SECURITIES AND EXCHANGE COMMISSION
WASHINGTON, DC 20549
SCHEDULE 13D
(Rule 13d-101)
INFORMATION TO BE INCLUDED IN STATEMENTS FILED PURSUANT TO RULE 13d-1(a)
AND AMENDMENTS THERETO FILED PURSUANT TO RULE 13d-2(a)
Under the Securities Exchange Act of 1934
(Amendment No. 2)*
TILE SHOP HOLDINGS, INC.
(Name of Issuer)
...."""
df = spark.createDataFrame([[text]]).toDF("text")
model = nlpPipeline.fit(df)
result = model.transform(df)
```
## Results
```bash
['schedule']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_sec_schedules_filings|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.7 MB|
## References
SEC's Edgar
## Benchmarking
```bash
label precision recall f1-score support
10-K 0.93 0.90 0.92 42
10-Q 0.95 0.95 0.95 38
3 0.62 0.61 0.62 33
4 0.82 0.78 0.80 54
8-K 0.86 0.91 0.88 33
S-8 0.93 0.96 0.95 28
other 1.00 1.00 1.00 238
schedule 0.94 0.96 0.95 50
accuracy - - 0.93 516
macro-avg 0.88 0.88 0.88 516
weighted-avg 0.93 0.93 0.93 516
```
---
layout: model
title: RxNorm Xsmall ChunkResolver
author: John Snow Labs
name: chunkresolve_rxnorm_xsmall_clinical
class: ChunkEntityResolverModel
language: en
nav_key: models
repository: clinical/models
date: 2020-06-24
task: Entity Resolution
edition: Healthcare NLP 2.5.2
spark_version: 2.4
tags: [clinical,licensed,entity_resolution,en]
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance.
{:.h2_title}
## Predicted Entities
RxNorm Codes and their normalized definition with `clinical_embeddings`.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_xsmall_clinical_en_2.5.2_2.4_1592959394598.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_xsmall_clinical_en_2.5.2_2.4_1592959394598.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
...
rxnorm_resolver = ChunkEntityResolverModel()\
.pretrained('chunkresolve_rxnorm_xsmall_clinical', 'en', "clinical/models")\
.setEnableLevenshtein(True)\
.setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\
.setInputCols(['token', 'chunk_embeddings'])\
.setOutputCol('rxnorm_resolution')\
.setPoolingStrategy("MAX")
pipeline_rxnorm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver])
model = pipeline_rxnorm.fit(spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."""]]).toDF("text"))
results = model.transform(data)
```
```scala
...
val rxnorm_resolver = ChunkEntityResolverModel()
.pretrained('chunkresolve_rxnorm_xsmall_clinical', 'en', "clinical/models")
.setEnableLevenshtein(True)
.setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,11,0,0,0,9))
.setInputCols('token', 'chunk_embeddings')
.setOutputCol('rxnorm_resolution')
.setPoolingStrategy("MAX")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
| chunk| entity| target_text| code|confidence|
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
| metformin|TREATMENT|Glipizide Metformin hydrochloride:::Glyburide Metformin hydrochloride:::Glipizide Metformin hydro...| 861731| 0.2000|
| glipizide|TREATMENT| Glipizide:::Glipizide:::Glipizide:::Glipizide:::Glipizide Metformin hydrochloride| 310488| 0.2499|
|dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|dapagliflozin saxagliptin:::dapagliflozin saxagliptin:::dapagliflozin saxagliptin:::dapagliflozin...|1925504| 0.2080|
| dapagliflozin|TREATMENT| dapagliflozin:::dapagliflozin:::dapagliflozin:::dapagliflozin:::dapagliflozin saxagliptin|1488574| 0.2492|
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|-------------------------------------|
| Name: | chunkresolve_rxnorm_xsmall_clinical |
| Type: | ChunkEntityResolverModel |
| Compatibility: | Spark NLP 2.5.2+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | [token, chunk_embeddings] |
|Output labels: | [entity] |
| Language: | en |
| Case sensitive: | True |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on December 2019 RxNorm Subset
http://www.snomed.org/
---
layout: model
title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_Tommi TFWav2Vec2ForCTC from Tommi
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_53_finnish_by_Tommi
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_Tommi` is a Finnish model originally trained by Tommi.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_fi_4.2.0_3.0_1664020930598.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_fi_4.2.0_3.0_1664020930598.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_Tommi", "fi")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_Tommi", "fi")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_53_finnish_by_Tommi|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|fi|
|Size:|1.2 GB|
---
layout: model
title: Pipeline to Detect Medication Entities, Assign Assertion Status and Find Relations
author: John Snow Labs
name: explain_clinical_doc_medication
date: 2022-04-01
tags: [licensed, en, clinical, ner, assertion, relation_extraction, posology]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 3.4.2
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pipeline for detecting posology entities with the `ner_posology_large` NER model, assigning their assertion status with `assertion_jsl` model, and extracting relations between posology-related terminology with `posology_re` relation extraction model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_medication_en_3.4.2_3.0_1648813363898.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_medication_en_3.4.2_3.0_1648813363898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("explain_clinical_doc_medication", "en", "clinical/models")
result = pipeline.fullAnnotate("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""")[0]
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("explain_clinical_doc_medication", "en", "clinical/models")
val result = pipeline.fullAnnotate("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""")(0)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.explain_dco.clinical_medication.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""")
```
## Results
```bash
+----+----------------+------------+
| | chunks | entities |
|---:|:---------------|:-----------|
| 0 | insulin | DRUG |
| 1 | Bactrim | DRUG |
| 2 | for 14 days | DURATION |
| 3 | 5000 units | DOSAGE |
| 4 | Fragmin | DRUG |
| 5 | subcutaneously | ROUTE |
| 6 | daily | FREQUENCY |
| 7 | Lantus | DRUG |
| 8 | 40 units | DOSAGE |
| 9 | subcutaneously | ROUTE |
| 10 | at bedtime | FREQUENCY |
+----+----------------+------------+
+----+----------+------------+-------------+
| | chunks | entities | assertion |
|---:|:---------|:-----------|:------------|
| 0 | insulin | DRUG | Present |
| 1 | Bactrim | DRUG | Past |
| 2 | Fragmin | DRUG | Planned |
| 3 | Lantus | DRUG | Planned |
+----+----------+------------+-------------+
+----------------+-----------+------------+-----------+----------------+
| relation | entity1 | chunk1 | entity2 | chunk2 |
|:---------------|:----------|:-----------|:----------|:---------------|
| DRUG-DURATION | DRUG | Bactrim | DURATION | for 14 days |
| DOSAGE-DRUG | DOSAGE | 5000 units | DRUG | Fragmin |
| DRUG-ROUTE | DRUG | Fragmin | ROUTE | subcutaneously |
| DRUG-FREQUENCY | DRUG | Fragmin | FREQUENCY | daily |
| DRUG-DOSAGE | DRUG | Lantus | DOSAGE | 40 units |
| DRUG-ROUTE | DRUG | Lantus | ROUTE | subcutaneously |
| DRUG-FREQUENCY | DRUG | Lantus | FREQUENCY | at bedtime |
+----------------+-----------+------------+-----------+----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_clinical_doc_medication|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternal
- NerConverterInternal
- AssertionDLModel
- PerceptronModel
- DependencyParserModel
- PosologyREModel
---
layout: model
title: Pipeline to Detect Clinical Entities (WIP Greedy)
author: John Snow Labs
name: jsl_ner_wip_greedy_clinical_pipeline
date: 2022-03-21
tags: [licensed, ner, wip, clinical, greedy, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [jsl_ner_wip_greedy_clinical](https://nlp.johnsnowlabs.com/2021/03/31/jsl_ner_wip_greedy_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_clinical_pipeline_en_3.4.1_3.0_1647866343183.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_clinical_pipeline_en_3.4.1_3.0_1647866343183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("jsl_ner_wip_greedy_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
```scala
val pipeline = new PretrainedPipeline("jsl_ner_wip_greedy_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.wip_greedy_clinical.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
+----------------------------------------------+----------------------------+
|chunk |ner_label |
+----------------------------------------------+----------------------------+
|21-day-old |Age |
|Caucasian |Race_Ethnicity |
|male |Gender |
|for 2 days |Duration |
|congestion |Symptom |
|mom |Gender |
|suctioning yellow discharge |Symptom |
|nares |External_body_part_or_region|
|she |Gender |
|mild problems with his breathing while feeding|Symptom |
|perioral cyanosis |Symptom |
|retractions |Symptom |
|One day ago |RelativeDate |
|mom |Gender |
|tactile temperature |Symptom |
|Tylenol |Drug |
|Baby |Age |
|decreased p.o. intake |Symptom |
|His |Gender |
|20 minutes |Duration |
+----------------------------------------------+----------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|jsl_ner_wip_greedy_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011)
author: John Snow Labs
name: distilbert_token_classifier_autotrain_company_all_903429540
date: 2023-03-14
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-company_all-903429540` is a English model originally trained by `ismail-lucifer011`.
## Predicted Entities
`Company`, `OOV`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429540_en_4.3.1_3.0_1678783374440.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429540_en_4.3.1_3.0_1678783374440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429540","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429540","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_autotrain_company_all_903429540|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ismail-lucifer011/autotrain-company_all-903429540
---
layout: model
title: English asr_distil_wav2vec2 TFWav2Vec2ForCTC from OthmaneJ
author: John Snow Labs
name: pipeline_asr_distil_wav2vec2
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_distil_wav2vec2` is a English model originally trained by OthmaneJ.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_distil_wav2vec2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_distil_wav2vec2_en_4.2.0_3.0_1664020989973.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_distil_wav2vec2_en_4.2.0_3.0_1664020989973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_distil_wav2vec2', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_distil_wav2vec2", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_distil_wav2vec2|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|188.9 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Pipeline to Classify Texts into 4 News Categories
author: John Snow Labs
name: bert_sequence_classifier_age_news_pipeline
date: 2022-06-19
tags: [ag_news, news, bert, bert_sequence, classification, en, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_sequence_classifier_age_news_en](https://nlp.johnsnowlabs.com/2021/11/07/bert_sequence_classifier_age_news_en.html) which is imported from `HuggingFace`.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_pipeline_en_4.0.0_3.0_1655653779437.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_pipeline_en_4.0.0_3.0_1655653779437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
news_pipeline = PretrainedPipeline("bert_sequence_classifier_age_news_pipeline", lang = "en")
news_pipeline.annotate("Microsoft has taken its first step into the metaverse.")
```
```scala
val news_pipeline = new PretrainedPipeline("bert_sequence_classifier_age_news_pipeline", lang = "en")
news_pipeline.annotate("Microsoft has taken its first step into the metaverse.")
```
## Results
```bash
['Sci/Tech']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_age_news_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|42.4 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- BertForSequenceClassification
---
layout: model
title: Legal NER for NDA (Return of Confidential Information Clauses)
author: John Snow Labs
name: legner_nda_return_of_conf_info
date: 2023-04-19
tags: [en, legal, licensed, ner, nda]
task: Named Entity Recognition
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a NER model, aimed to be run **only** after detecting the `RETURN_OF_CONF_INFO` clause with a proper classifier (use `legmulticlf_mnda_sections_paragraph_other` model for that purpose). It will extract the following entities: `ARCHIVAL_PURPOSE`, and `LEGAL_PURPOSE`.
## Predicted Entities
`ARCHIVAL_PURPOSE`, `LEGAL_PURPOSE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_return_of_conf_info_en_1.0.0_3.0_1681936414470.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_return_of_conf_info_en_1.0.0_3.0_1681936414470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_nda_return_of_conf_info", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""Notwithstanding the foregoing, the Recipient and its Representatives may retain copies of the Confidential Information to the extent that such retention is required to demonstrate compliance with applicable law or governmental rule or regulation, to the extent included in any board or executive documents relating to the proposed business relationship, and in its archives for backup purposes subject to the confidentiality provisions of this Agreement."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+--------------+----------------+
|chunk |ner_label |
+--------------+----------------+
|applicable law|LEGAL_PURPOSE |
|governmental |LEGAL_PURPOSE |
|regulation |LEGAL_PURPOSE |
|archives |ARCHIVAL_PURPOSE|
|backup |ARCHIVAL_PURPOSE|
+--------------+----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_nda_return_of_conf_info|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.3 MB|
## References
In-house annotations on the Non-disclosure Agreements
## Benchmarking
```bash
label precision recall f1-score support
ARCHIVAL_PURPOSE 0.94 1.00 0.97 16
LEGAL_PURPOSE 0.78 0.85 0.81 33
micro-avg 0.83 0.90 0.86 49
macro-avg 0.86 0.92 0.89 49
weighted-avg 0.83 0.90 0.86 49
```
---
layout: model
title: English asr_xlsr_wav2vec_english TFWav2Vec2ForCTC from harshit345
author: John Snow Labs
name: asr_xlsr_wav2vec_english
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec_english` is a English model originally trained by harshit345.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_wav2vec_english_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec_english_en_4.2.0_3.0_1664043295043.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec_english_en_4.2.0_3.0_1664043295043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_xlsr_wav2vec_english", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_xlsr_wav2vec_english", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_xlsr_wav2vec_english|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from AurasuiteAgreements)
author: John Snow Labs
name: bert_qa_base_uncased_contracts_finetuned_on_squadv2
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-contracts-finetuned-on-squadv2` is a English model originally trained by `AurasuiteAgreements`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_contracts_finetuned_on_squadv2_en_4.0.0_3.0_1657183854663.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_contracts_finetuned_on_squadv2_en_4.0.0_3.0_1657183854663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_contracts_finetuned_on_squadv2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_contracts_finetuned_on_squadv2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_contracts_finetuned_on_squadv2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/AurasuiteAgreements/bert-base-uncased-contracts-finetuned-on-squadv2
---
layout: model
title: Turkish BertForTokenClassification Cased model (from busecarik)
author: John Snow Labs
name: bert_token_classifier_berturk_sunlp_ner_turkish
date: 2022-11-30
tags: [tr, open_source, bert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: tr
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `berturk-sunlp-ner-turkish` is a Turkish model originally trained by `busecarik`.
## Predicted Entities
`PRODUCT`, `TIME`, `MONEY`, `ORGANIZATION`, `LOCATION`, `TVSHOW`, `PERSON`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_berturk_sunlp_ner_turkish_tr_4.2.4_3.0_1669815581712.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_berturk_sunlp_ner_turkish_tr_4.2.4_3.0_1669815581712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_berturk_sunlp_ner_turkish","tr") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_berturk_sunlp_ner_turkish","tr")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_berturk_sunlp_ner_turkish|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|tr|
|Size:|689.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/busecarik/berturk-sunlp-ner-turkish
- https://github.com/SU-NLP/SUNLP-Twitter-NER-Dataset
---
layout: model
title: Legal Conditions Clause Binary Classifier
author: John Snow Labs
name: legclf_conditions_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `conditions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `conditions`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_conditions_clause_en_1.0.0_3.2_1660123336869.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_conditions_clause_en_1.0.0_3.2_1660123336869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[conditions]|
|[other]|
|[other]|
|[conditions]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_conditions_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.1 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
conditions 0.89 0.78 0.83 82
other 0.90 0.95 0.93 175
accuracy - - 0.90 257
macro-avg 0.90 0.87 0.88 257
weighted-avg 0.90 0.90 0.90 257
```
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from ksabeh)
author: John Snow Labs
name: distilbert_qa_attribute_correction_mlm_titles
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-attribute-correction-mlm-titles` is a English model originally trained by `ksabeh`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_attribute_correction_mlm_titles_en_4.3.0_3.0_1672766395443.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_attribute_correction_mlm_titles_en_4.3.0_3.0_1672766395443.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_attribute_correction_mlm_titles","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_attribute_correction_mlm_titles","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_attribute_correction_mlm_titles|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ksabeh/distilbert-attribute-correction-mlm-titles
---
layout: model
title: Legal Disclaimer Clause Binary Classifier
author: John Snow Labs
name: legclf_disclaimer_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `disclaimer` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `disclaimer`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_disclaimer_clause_en_1.0.0_3.2_1660123418981.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_disclaimer_clause_en_1.0.0_3.2_1660123418981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[disclaimer]|
|[other]|
|[other]|
|[disclaimer]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_disclaimer_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
disclaimer 0.94 0.89 0.91 35
other 0.93 0.96 0.95 55
accuracy - - 0.93 90
macro-avg 0.93 0.92 0.93 90
weighted-avg 0.93 0.93 0.93 90
```
---
layout: model
title: Hebrew Lemmatizer
author: John Snow Labs
name: lemma
date: 2020-12-09
task: Lemmatization
language: he
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [lemmatizer, he, open_source]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_he_2.7.0_2.4_1607522684355.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_he_2.7.0_2.4_1607522684355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of a pipeline after tokenisation.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
lemmatizer = LemmatizerModel.pretrained("lemma", "he") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
results = light_pipeline.fullAnnotate(["""להגיש הגישה הגיש הגשתי יגיש מגישים הגישו תגיש הגשנו מגישה"""])
```
```scala
...
val lemmatizer = LemmatizerModel.pretrained("lemma", "he")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("להגיש הגישה הגיש הגשתי יגיש מגישים הגישו תגיש הגשנו מגישה").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""להגיש הגישה הגיש הגשתי יגיש מגישים הגישו תגיש הגשנו מגישה"""]
lemma_df = nlu.load('he.lemma').predict(text, output_level='document')
lemma_df.lemma.values[0]
```
## Results
```bash
{'lemma': [Annotation(token, 0, 4, הגיש, {'sentence': '0'}),
Annotation(token, 6, 10, הגיש, {'sentence': '0'}),
Annotation(token, 12, 15, הגיש, {'sentence': '0'}),
Annotation(token, 17, 21, הגיש, {'sentence': '0'}),
Annotation(token, 23, 26, הגיש, {'sentence': '0'}),
Annotation(token, 28, 33, הגיש, {'sentence': '0'}),
Annotation(token, 35, 39, הגיש, {'sentence': '0'}),
Annotation(token, 41, 44, הגיש, {'sentence': '0'}),
Annotation(token, 46, 50, הגיש, {'sentence': '0'}),
Annotation(token, 52, 56, הגיש, {'sentence': '0'})]}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[tokens]|
|Output Labels:|[lemma]|
|Language:|he|
## Data Source
This model is trained on data obtained from [https://universaldependencies.org/](https://universaldependencies.org/)
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from Moussab)
author: John Snow Labs
name: roberta_qa_deepset_base_squad2_orkg_how_1e_4
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-how-1e-4` is a English model originally trained by `Moussab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_how_1e_4_en_4.3.0_3.0_1674209475140.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_how_1e_4_en_4.3.0_3.0_1674209475140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_how_1e_4","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_how_1e_4","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_deepset_base_squad2_orkg_how_1e_4|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-how-1e-4
---
layout: model
title: English T5ForConditionalGeneration Cased model (from rajistics)
author: John Snow Labs
name: t5_informal_formal_style_transfer
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `informal_formal_style_transfer` is a English model originally trained by `rajistics`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_informal_formal_style_transfer_en_4.3.0_3.0_1675103071459.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_informal_formal_style_transfer_en_4.3.0_3.0_1675103071459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_informal_formal_style_transfer","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_informal_formal_style_transfer","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_informal_formal_style_transfer|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|924.5 MB|
## References
- https://huggingface.co/rajistics/informal_formal_style_transfer
- https://github.com/PrithivirajDamodaran/Styleformer
- https://www.aclweb.org/anthology/D19-5502.pdf
- http://cs230.stanford.edu/projects_winter_2020/reports/32069807.pdf
- https://arxiv.org/pdf/1804.06437.pdf
---
layout: model
title: Spanish Named Entity Recognition (from mrm8488)
author: John Snow Labs
name: bert_ner_TinyBERT_spanish_uncased_finetuned_ner
date: 2022-05-09
tags: [bert, ner, token_classification, es, open_source]
task: Named Entity Recognition
language: es
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `TinyBERT-spanish-uncased-finetuned-ner` is a Spanish model orginally trained by `mrm8488`.
## Predicted Entities
`LOC`, `PER`, `ORG`, `MISC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_TinyBERT_spanish_uncased_finetuned_ner_es_3.4.2_3.0_1652096474583.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_TinyBERT_spanish_uncased_finetuned_ner_es_3.4.2_3.0_1652096474583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_TinyBERT_spanish_uncased_finetuned_ner","es") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_TinyBERT_spanish_uncased_finetuned_ner","es")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Amo Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_TinyBERT_spanish_uncased_finetuned_ner|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|54.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/mrm8488/TinyBERT-spanish-uncased-finetuned-ner
- https://www.kaggle.com/nltkdata/conll-corpora
- https://www.kaggle.com/nltkdata/conll-corpora
- https://twitter.com/mrm8488
---
layout: model
title: Financial News Multilabel Classifier
author: John Snow Labs
name: finmulticlf_news
date: 2022-08-30
tags: [en, finance, classification, news, licensed]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: MultiClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Multilabel classification model trained on different news scrapped from the internet and in-house annotations and label grouping. As this model is Multilabel, you can get as an output of a financial new, an array of 0 (no classes detected), 1(one class) or N (n classes detected).
The available classes are:
- acq: Acquisition / Purchase operations
- finance: Generic financial news
- fuel: News about fuel and energy sources
- jobs: News about jobs, employment rates, etc.
- livestock: News about animales and livestock
- mineral: News about mineral as copper, gold, silver, coal, etc.
- plant: News about greens, plants, cereals, etc
- trade: Trading news
## Predicted Entities
`acq`, `finance`, `fuel`, `jobs`, `livestock`, `mineral`, `plant`, `trade`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFICATION_MULTILABEL/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmulticlf_news_en_1.0.0_3.2_1661857631377.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmulticlf_news_en_1.0.0_3.2_1661857631377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document") \
.setCleanupMode("shrink")
embeddings = nlp.UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("embeddings")
docClassifier = nlp.MultiClassifierDLModel.pretrained("finmulticlf_news", "en","finance/models")\
.setInputCols("embeddings") \
.setOutputCol("category")
pipeline = nlp.Pipeline() \
.setStages(
[
documentAssembler,
embeddings,
docClassifier
]
)
empty_data = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(empty_data)
text = ["""
ECUADOR HAS TRADE SURPLUS IN FIRST FOUR MONTHS Ecuador posted a trade surplus of 10.6 mln dlrs in the first four months of 1987 compared with a surplus of 271.7 mln in the same period in 1986, the central bank of Ecuador said in its latest monthly report. Ecuador suspended sales of crude oil, its principal export product, in March after an earthquake destroyed part of its oil-producing infrastructure. Exports in the first four months of 1987 were around 639 mln dlrs and imports 628.3 mln, compared with 771 mln and 500 mln respectively in the same period last year. Exports of crude and products in the first four months were around 256.1 mln dlrs, compared with 403.3 mln in the same period in 1986. The central bank said that between January and May Ecuador sold 16.1 mln barrels of crude and 2.3 mln barrels of products, compared with 32 mln and 2.7 mln respectively in the same period last year. Ecuador's international reserves at the end of May were around 120.9 mln dlrs, compared with 118.6 mln at the end of April and 141.3 mln at the end of May 1986, the central bank said. gold reserves were 165.7 mln dlrs at the end of May compared with 124.3 mln at the end of April.
"""]
lmodel = LightPipeline(pipelineModel)
results = lmodel.annotate(text)
```
## Results
```bash
['finance', 'trade']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finmulticlf_news|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|12.9 MB|
## References
News scrapped from the Internet and manual in-house annotations
## Benchmarking
```bash
label precision recall f1-score support
acq 0.94 0.92 0.93 718
finance 0.95 0.96 0.96 1499
fuel 0.91 0.86 0.88 286
jobs 0.86 0.57 0.69 21
livestock 0.93 0.44 0.60 57
mineral 0.87 0.62 0.72 121
plant 0.89 0.88 0.89 301
trade 0.79 0.72 0.75 113
micro-avg 0.93 0.90 0.92 3116
macro-avg 0.89 0.75 0.80 3116
weighted-avg 0.93 0.90 0.91 3116
samples-avg 0.91 0.91 0.91 3116
```
---
layout: model
title: English image_classifier_vit_vc_bantai__withoutAMBI_adunest ViTForImageClassification from AykeeSalazar
author: John Snow Labs
name: image_classifier_vit_vc_bantai__withoutAMBI_adunest
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vc_bantai__withoutAMBI_adunest` is a English model originally trained by AykeeSalazar.
## Predicted Entities
`nonViolation`, `publicDrinking`, `publicSmoking`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_en_4.1.0_3.0_1660166079256.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_en_4.1.0_3.0_1660166079256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_vc_bantai__withoutAMBI_adunest|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English image_classifier_vit_apes ViTForImageClassification from ducnapa
author: John Snow Labs
name: image_classifier_vit_apes
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_apes` is a English model originally trained by ducnapa.
## Predicted Entities
`chimpanzee`, `gibbon`, `gorilla`, `orangutan`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_apes_en_4.1.0_3.0_1660172348568.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_apes_en_4.1.0_3.0_1660172348568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_apes", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_apes", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_apes|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Stopwords Remover for Amharic language (228 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, am, open_source]
task: Stop Words Removal
language: am
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_am_3.4.1_3.0_1646673268681.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_am_3.4.1_3.0_1646673268681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","am") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["አንተ አልተሻልክም።"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","am")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("አንተ አልተሻልክም።").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("am.stopwords").predict("""አንተ አልተሻልክም።""")
```
## Results
```bash
+----------+
|result |
+----------+
|[አልተሻልክም።]|
+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|am|
|Size:|2.5 KB|
---
layout: model
title: TREC(50) Question Classifier
author: John Snow Labs
name: classifierdl_use_trec50
class: ClassifierDLModel
language: en
nav_key: models
repository: public/models
date: 03/05/2020
task: Text Classification
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [classifier]
supported: true
annotator: ClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Classify open-domain, fact-based questions into sub categories of the following broad semantic categories: Abbreviation, Description, Entities, Human Beings, Locations or Numeric Values.
{:.h2_title}
## Predicted Entities
``ENTY_animal``, ``ENTY_body``, ``ENTY_color``, ``ENTY_cremat``, ``ENTY_currency``, ``ENTY_dismed``, ``ENTY_event``, ``ENTY_food``, ``ENTY_instru``, ``ENTY_lang``, ``ENTY_letter``, ``ENTY_other``, ``ENTY_plant``, ``ENTY_product``, ``ENTY_religion``, ``ENTY_sport``, ``ENTY_substance``, ``ENTY_symbol``, ``ENTY_techmeth``, ``ENTY_termeq``, ``ENTY_veh``, ``ENTY_word``, ``DESC_def``, ``DESC_desc``, ``DESC_manner``, ``DESC_reason``, ``HUM_gr``, ``HUM_ind``, ``HUM_title``, ``HUM_desc``, ``LOC_city``, ``LOC_country``, ``LOC_mount``, ``LOC_other``, ``LOC_state``, ``NUM_code``, ``NUM_count``, ``NUM_date``, ``NUM_dist``, ``NUM_money``, ``NUM_ord``, ``NUM_other``, ``NUM_period``, ``NUM_perc``, ``NUM_speed``, ``NUM_temp``, ``NUM_volsize``, ``NUM_weight``, ``ABBR_abb``, ``ABBR_exp``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_TREC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_TREC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec50_en_2.5.0_2.4_1588493558481.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec50_en_2.5.0_2.4_1588493558481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained(lang="en") \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
document_classifier = ClassifierDLModel.pretrained('classifierdl_use_trec50', 'en') \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")
nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate('When did the construction of stone circles begin in the UK?')
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val use = UniversalSentenceEncoder.pretrained(lang="en")
.setInputCols(Array("document"))
.setOutputCol("sentence_embeddings")
val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_trec50", "en")
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier))
val data = Seq("When did the construction of stone circles begin in the UK?").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""]
trec50_df = nlu.load('en.classify.trec50.use').predict(text, output_level = "document")
trec50_df[["document", "trec50"]]
```
{:.h2_title}
## Results
{:.table-model}
```bash
+------------------------------------------------------------------------------------------------+------------+
|document |class |
+------------------------------------------------------------------------------------------------+------------+
|When did the construction of stone circles begin in the UK? | NUM_date |
+------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
| Model Name | classifierdl_use_trec50 |
| Model Class | ClassifierDLModel |
| Spark Compatibility | 2.5.0 |
| Spark NLP Compatibility | 2.4 |
| License | open source|
| Edition | public |
| Input Labels | [document, sentence_embeddings] |
| Output Labels | [class] |
| Language | en|
| Upstream Dependencies | with tfhub_use |
{:.h2_title}
## Data Source
This model is trained on the 50 class version of TREC dataset. http://search.r-project.org/library/textdata/html/dataset_trec.html
---
layout: model
title: Malay T5ForConditionalGeneration Base Cased model (from mesolitica)
author: John Snow Labs
name: t5_finetune_paraphrase_base_standard_bahasa_cased
date: 2023-01-30
tags: [ms, open_source, t5, tensorflow]
task: Text Generation
language: ms
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-paraphrase-t5-base-standard-bahasa-cased` is a Malay model originally trained by `mesolitica`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetune_paraphrase_base_standard_bahasa_cased_ms_4.3.0_3.0_1675102005289.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetune_paraphrase_base_standard_bahasa_cased_ms_4.3.0_3.0_1675102005289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_finetune_paraphrase_base_standard_bahasa_cased","ms") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_finetune_paraphrase_base_standard_bahasa_cased","ms")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_finetune_paraphrase_base_standard_bahasa_cased|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|ms|
|Size:|926.8 MB|
## References
- https://huggingface.co/mesolitica/finetune-paraphrase-t5-base-standard-bahasa-cased
- https://github.com/huseinzol05/malaya/tree/master/session/paraphrase/hf-t5
---
layout: model
title: T5 Clinical Summarization / QA model
author: John Snow Labs
name: t5_base_mediqa_mnli
date: 2021-02-19
tags: [t5, licensed, clinical, en]
supported: true
recommended: true
task: Summarization
language: en
nav_key: models
edition: Healthcare NLP 2.7.4
spark_version: 2.4
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The T5 transformer model described in the seminal paper “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” can perform a variety of tasks, such as text summarization, question answering and translation. More details about using the model can be found in the paper (https://arxiv.org/pdf/1910.10683.pdf). This model is specifically trained on medical data for text summarization and question answering.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/t5_base_mediqa_mnli_en_2.7.4_2.4_1613750257481.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/t5_base_mediqa_mnli_en_2.7.4_2.4_1613750257481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("documents")
sentence_detector = SentenceDetectorDLModel().pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols("documents")\
.setOutputCol("sentence")
t5 = T5Transformer().pretrained("t5_base_mediqa_mnli", "en", "clinical/models") \
.setInputCols(["sentence"]) \
.setOutputCol("t5_output")\
.setTask("summarize medical questions:")\
.setMaxOutputLength(200)
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
t5
])
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
t5
])
data = spark.createDataFrame([
[1, "content:SUBJECT: Normal physical traits but no period MESSAGE: I'm a 40 yr. old woman that has infantile reproductive organs and have never experienced a mensus. I have had Doctors look but they all say I just have infantile female reproductive organs. When I try to look for answers on the internet I cannot find anything. ALL my \"girly\" parts are normal. My organs never matured. Could you give me more information please. focus:all"]
]).toDF('id', 'text')
results = pipeline.fit(data).transform(data)
results.select("t5_output.result").show(truncate=False)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.t5.mediqa").predict("""content:SUBJECT: Normal physical traits but no period MESSAGE: I'm a 40 yr. old woman that has infantile reproductive organs and have never experienced a mensus. I have had Doctors look but they all say I just have infantile female reproductive organs. When I try to look for answers on the internet I cannot find anything. ALL my \""")
```
## Results
```bash
What are the treatments for mensus?, What are the treatments for infantile female reproductive organs?, What are the treatments for cancer?, What are the treatments for organ transplantation?, What are the treatments for cancer?, What are the treatments for cancer?
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_base_mediqa_mnli|
|Compatibility:|Healthcare NLP 2.7.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
## Data Source
Trained on MEDIQA2021 and MedNLI Datasets
---
layout: model
title: Explain Document Pipeline for Swedish
author: John Snow Labs
name: explain_document_md
date: 2021-03-22
tags: [open_source, swedish, explain_document_md, pipeline, sv]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: sv
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_sv_3.0.0_3.0_1616436435552.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_sv_3.0.0_3.0_1616436435552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_md', lang = 'sv')
annotations = pipeline.fullAnnotate(""Hej från John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_md", lang = "sv")
val result = pipeline.fullAnnotate("Hej från John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hej från John Snow Labs! ""]
result_df = nlu.load('sv.explain.md').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:------------------------------|:-----------------------------|:-----------------------------------------|:-----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Hej från John Snow Labs! '] | ['Hej från John Snow Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.4006600081920624,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_md|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|sv|
---
layout: model
title: Fast Neural Machine Translation Model from Tok Pisin to English
author: John Snow Labs
name: opus_mt_tpi_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, tpi, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `tpi`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_tpi_en_xx_2.7.0_2.4_1609166984301.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_tpi_en_xx_2.7.0_2.4_1609166984301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_tpi_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_tpi_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.tpi.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_tpi_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for ICD-O (sbiobertresolve_icdo_augmented)
author: John Snow Labs
name: sbiobertresolve_icdo_augmented
date: 2021-06-22
tags: [licensed, en, clinical, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.1.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD-O codes using sBioBert sentence embeddings. This model is augmented using the site information coming from ICD10 and synonyms coming from SNOMED vocabularies. It is trained with a dataset that is 20x larger than the previous version of ICDO resolver.
Given an oncological entity found in the text (via NER models like ner_jsl), it returns top terms and resolutions along with the corresponding ICD-O codes to present more granularity with respect to body parts mentioned. It also returns the original `Topography` codes, `Morphology` codes comprising of `Histology` and `Behavior` codes, and descriptions.
## Predicted Entities
ICD-O Codes and their normalized definition with `sbiobert_base_cased_mli ` embeddings.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_augmented_en_3.1.0_3.0_1624344274944.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_augmented_en_3.1.0_3.0_1624344274944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("jsl_ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "jsl_ner"]) \
.setOutputCol("ner_chunk")
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
icdo_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icdo_augmented","en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver])
data = spark.createDataFrame([["The patient is a very pleasant 61-year-old female with a strong family history of colon polyps. The patient reports her first polyps noted at the age of 50. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. She also has history of several malignancies in the family. Her father died of a brain tumor at the age of 81. Her sister died at the age of 65 breast cancer. She has two maternal aunts with history of lung cancer both of whom were smoker. Also a paternal grandmother who was diagnosed with leukemia at 86 and a paternal grandfather who had B-cell lymphoma."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
...
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("jsl_ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "jsl_ner"))
.setOutputCol("ner_chunk")
val chunk2doc = new Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val icdo_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icdo_augmented","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver))
val data = Seq("The patient is a very pleasant 61-year-old female with a strong family history of colon polyps. The patient reports her first polyps noted at the age of 50. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. She also has history of several malignancies in the family. Her father died of a brain tumor at the age of 81. Her sister died at the age of 65 breast cancer. She has two maternal aunts with history of lung cancer both of whom were smoker. Also a paternal grandmother who was diagnosed with leukemia at 86 and a paternal grandfather who had B-cell lymphoma.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icdo_augmented").predict("""The patient is a very pleasant 61-year-old female with a strong family history of colon polyps. The patient reports her first polyps noted at the age of 50. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. She also has history of several malignancies in the family. Her father died of a brain tumor at the age of 81. Her sister died at the age of 65 breast cancer. She has two maternal aunts with history of lung cancer both of whom were smoker. Also a paternal grandmother who was diagnosed with leukemia at 86 and a paternal grandfather who had B-cell lymphoma.""")
```
## Results
```bash
+--------------------+-----+---+-----------+-------------+-------------------------+-------------------------+
| chunk|begin|end| entity| code| all_k_resolutions| all_k_codes|
+--------------------+-----+---+-----------+-------------+-------------------------+-------------------------+
| mesothelioma| 255|266|Oncological|9971/3||C38.3|malignant mediastinal ...|9971/3||C38.3:::8854/3...|
|several malignancies| 293|312|Oncological|8894/3||C39.8|overlapping malignant ...|8894/3||C39.8:::8070/2...|
| brain tumor| 350|360|Oncological|9562/0||C71.9|cancer of the brain:::...|9562/0||C71.9:::9070/3...|
| breast cancer| 413|425|Oncological|9691/3||C50.9|carcinoma of breast:::...|9691/3||C50.9:::8070/2...|
| lung cancer| 471|481|Oncological|8814/3||C34.9|malignant tumour of lu...|8814/3||C34.9:::8550/3...|
| leukemia| 560|567|Oncological|9670/3||C80.9|anemia in neoplastic d...|9670/3||C80.9:::9714/3...|
| B-cell lymphoma| 610|624|Oncological|9818/3||C77.9|secondary malignant ne...|9818/3||C77.9:::9655/3...|
+--------------------+-----+---+-----------+-------------+-------------------------+-------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icdo_augmented|
|Compatibility:|Healthcare NLP 3.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icdo_code]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on ICD-O Histology Behaviour dataset with `sbiobert_base_cased_mli ` sentence embeddings. https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf
---
layout: model
title: German DistilBERT Embeddings
author: John Snow Labs
name: distilbert_embeddings_distilbert_base_german_cased
date: 2022-04-12
tags: [distilbert, embeddings, de, open_source]
task: Embeddings
language: de
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-german-cased` is a German model orginally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_german_cased_de_3.4.2_3.0_1649783682880.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_german_cased_de_3.4.2_3.0_1649783682880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_german_cased","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_german_cased","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Funken NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.embed.distilbert_base_german_cased").predict("""Ich liebe Funken NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_distilbert_base_german_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|250.6 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/distilbert-base-german-cased
---
layout: model
title: Legal ORG, PRODUCT and ALIAS NER (small)
author: John Snow Labs
name: legner_orgs_prods_alias
date: 2022-08-17
tags: [en, legal, ner, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a large Named Entity Recognition model, trained with a subset of generic conLL, financial and legal conll, ontonotes and several in-house corpora, to detect Organizations, Products and Aliases of Companies.
## Predicted Entities
`ORG`, `PROD`, `ALIAS`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_ORGPROD){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_orgs_prods_alias_en_1.0.0_3.2_1660733903868.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_orgs_prods_alias_en_1.0.0_3.2_1660733903868.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias","en","legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
"""]
res = model.transform(spark.createDataFrame([text]).toDF("text"))
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label"),
F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False)
```
## Results
```bash
+-----------------------------------+---------+----------+
|chunk |ner_label|confidence|
+-----------------------------------+---------+----------+
|Armstrong Flooring, Inc |ORG |0.807575 |
|Seller |ALIAS |0.997 |
|AFI Licensing LLC |ORG |0.7076333 |
|Licensing |ALIAS |0.9981 |
|Seller |ALIAS |0.996 |
|Arizona |ALIAS |0.9958 |
|AHF Holding, Inc. |ORG |0.72438 |
|Tarzan HoldCo, Inc |ORG |0.684675 |
|Buyer |ALIAS |0.9983 |
|Armstrong Hardwood Flooring Company|ORG |0.58274996|
|Company |ALIAS |0.9989 |
|Buyer |ALIAS |0.9979 |
|Buyer Entities |ALIAS |0.98835003|
|Arizona |ALIAS |0.9635 |
|Buyer Entities |ALIAS |0.77565 |
|Party |ALIAS |0.9982 |
+-----------------------------------+---------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_orgs_prods_alias|
|Type:|legal|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.7 MB|
## References
ConLL-2003, FinSec ConLL, a subset of Ontonotes, In-house corpora
## Benchmarking
```bash
label tp fp fn prec rec f1
I-ORG 12853 2621 2685 0.8306191 0.82719785 0.828905
B-PRODUCT 2306 697 932 0.76789874 0.712168 0.7389841
I-ALIAS 14 6 13 0.7 0.5185185 0.59574467
B-ORG 8967 2078 2311 0.81186056 0.79508775 0.80338657
I-PRODUCT 2336 803 1091 0.74418604 0.68164575 0.7115443
B-ALIAS 76 14 22 0.84444445 0.7755102 0.80851066
Macro-average 26552 6219 7054 0.78316814 0.7183547 0.7493626
Micro-average 26552 6219 7054 0.8102285 0.790097 0.80003613
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from Tianle)
author: John Snow Labs
name: distilbert_qa_Tianle_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Tianle`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Tianle_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724812794.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Tianle_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724812794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Tianle_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Tianle_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Tianle").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_Tianle_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Tianle/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Sentence Entity Resolver for RxNorm (NDC)
author: John Snow Labs
name: sbiobertresolve_rxnorm_ndc
date: 2021-10-05
tags: [licensed, clinical, en, ndc, rxnorm]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.2.3
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps `DRUG ` entities to RxNorm codes and their [National Drug Codes (NDC)](https://www.drugs.com/ndc.html#:~:text=The%20NDC%2C%20or%20National%20Drug,and%20the%20commercial%20package%20size.) using `sbiobert_base_cased_mli ` sentence embeddings. You can find all NDC codes of drugs seperated by `|` symbol in the all_k_aux_labels parameter of the metadata.
## Predicted Entities
`RxNorm Codes`, `NDC Codes`
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_ndc_en_3.2.3_2.4_1633424811842.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_ndc_en_3.2.3_2.4_1633424811842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings\
.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
.setInputCols(["ner_chunk"])\
.setOutputCol("sentence_embeddings")
rxnorm_ndc_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_rxnorm_ndc", "en", "clinical/models") \
.setInputCols(["ner_chunk", "sentence_embeddings"]) \
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
rxnorm_ndc_pipeline = Pipeline(
stages = [
documentAssembler,
sbert_embedder,
rxnorm_ndc_resolver])
data = spark.createDataFrame([["activated charcoal 30000 mg powder for oral suspension"]]).toDF("text")
res = rxnorm_ndc_pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli", "en","clinical/models")
.setInputCols(Array("ner_chunk")
.setOutputCol("sentence_embeddings")
val rxnorm_ndc_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_rxnorm_ndc", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sentence_embeddings"))
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val rxnorm_ndc_pipeline = new Pipeline().setStages(Array(documentAssembler, sbert_embedder, rxnorm_ndc_resolver))
val data = Seq("activated charcoal 30000 mg powder for oral suspension").toDF("text")
val res = rxnorm_ndc_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.rxnorm_ndc").predict("""activated charcoal 30000 mg powder for oral suspension""")
```
## Results
```bash
+--+------------------------------------------------------+-----------+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+
| |ner_chunk |rxnorm_code|all_codes |resolutions |all_k_aux_labels (ndc_codes) |
+--+------------------------------------------------------+-----------+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+
|0 |activated charcoal 30000 mg powder for oral suspension|1440919 |[1440919, 808917, 1088194, 1191772, 808921,...]|'activated charcoal 30000 MG Powder for Oral Suspension', 'Activated Charcoal 30000 MG Powder for Oral Suspension', 'wheat dextrin 3000 MG Powder for Oral Solution [Benefiber]', 'cellulose 3000 MG Oral Powder [Unifiber]', 'fosfomycin 3000 MG Powder for Oral Solution [Monurol]', ...|69784030828, 00395052791, 08679001362|86790016280|00067004490, 46017004408|68220004416, 00456430001,...|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_rxnorm_ndc|
|Compatibility:|Healthcare NLP 3.2.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[rxnorm_code]|
|Language:|en|
|Case sensitive:|false|
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from krinal214)
author: John Snow Labs
name: xlm_roberta_qa_xlm_all
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-all` is a English model originally trained by `krinal214`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_all_en_4.0.0_3.0_1655988363716.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_all_en_4.0.0_3.0_1655988363716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_all","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_all","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.tydiqa.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_all|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|924.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/krinal214/xlm-all
---
layout: model
title: French CamemBert Embeddings (from tnagata)
author: John Snow Labs
name: camembert_embeddings_tnagata_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `tnagata`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_tnagata_generic_model_fr_3.4.4_3.0_1653990401775.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_tnagata_generic_model_fr_3.4.4_3.0_1653990401775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_tnagata_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_tnagata_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_tnagata_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/tnagata/dummy-model
---
layout: model
title: Adverse Drug Events Classifier
author: John Snow Labs
name: classifierml_ade
date: 2023-05-04
tags: [ade, clinical, licensed, en, text_classification]
task: Text Classification
language: en
edition: Healthcare NLP 4.4.1
spark_version: 3.0
supported: true
annotator: DocumentMLClassifierModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained with the `DocumentMLClassifierApproach` annotator and classifies a text/sentence into two categories:
True : The sentence is talking about a possible ADE
False : The sentence doesn’t have any information about an ADE.
The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False).
## Predicted Entities
`True`, `False`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierml_ade_en_4.4.1_3.0_1683229229936.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierml_ade_en_4.4.1_3.0_1683229229936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
classifier_ml = DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")\
.setInputCols("token")\
.setOutputCol("prediction")
clf_Pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
classifier_ml])
data = spark.createDataFrame([["""I feel great after taking tylenol."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text")
result = clf_Pipeline.fit(data).transform(data)
```
```scala
val document_assembler =new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val classifier_ml = new DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")
.setInputCols("token")
.setOutputCol("prediction")
val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, classifier_ml))
val data = Seq(Array("I feel great after taking tylenol", "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.")).toDS().toDF("text")
val result = clf_Pipeline.fit(data).transform(data)
```
## Results
```bash
+----------------------------------------------------------------------------------------+-------+
|text |result |
+----------------------------------------------------------------------------------------+-------+
|I feel great after taking tylenol |[False]|
|Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[True] |
+----------------------------------------------------------------------------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierml_ade|
|Compatibility:|Healthcare NLP 4.4.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[prediction]|
|Language:|en|
|Size:|2.7 MB|
## References
The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False).
Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615
## Benchmarking
```bash
label precision recall f1-score support
False 0.90 0.94 0.92 3359
True 0.85 0.75 0.79 1364
accuracy - - 0.89 4723
macro avg 0.87 0.85 0.86 4723
weighted avg 0.89 0.89 0.89 4723
```
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_kv32
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-kv32` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv32_en_4.3.0_3.0_1675121475776.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv32_en_4.3.0_3.0_1675121475776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_kv32","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_kv32","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_kv32|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|129.8 MB|
## References
- https://huggingface.co/google/t5-efficient-small-kv32
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English BertForQuestionAnswering model (from andi611)
author: John Snow Labs
name: bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-squad2-with-ner-mit-movie-with-neg-with-repeat` is a English model orginally trained by `andi611`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat_en_4.0.0_3.0_1654537505373.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat_en_4.0.0_3.0_1654537505373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.movie_squadv2.bert.large_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/andi611/bert-large-uncased-whole-word-masking-squad2-with-ner-mit-movie-with-neg-with-repeat
---
layout: model
title: English ElectraForQuestionAnswering model (from ptran74) Version-3
author: John Snow Labs
name: electra_qa_DSPFirst_Finetuning_3
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `DSPFirst-Finetuning-3` is a English model originally trained by `ptran74`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_3_en_4.0.0_3.0_1655919564099.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_3_en_4.0.0_3.0_1655919564099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_3","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_3","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.electra.finetuning_3").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_DSPFirst_Finetuning_3|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ptran74/DSPFirst-Finetuning-3
- https://github.gatech.edu/pages/VIP-ITS/textbook_SQuAD_explore/explore/textbookv1.0/textbook/
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from thatdramebaazguy)
author: John Snow Labs
name: roberta_qa_movie_squad
date: 2022-12-02
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `movie-roberta-squad` is a English model originally trained by `thatdramebaazguy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_movie_squad_en_4.2.4_3.0_1669985305755.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_movie_squad_en_4.2.4_3.0_1669985305755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_movie_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_movie_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_movie_squad|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|466.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/thatdramebaazguy/movie-roberta-squad
- https://github.com/ibm-aur-nlp/domain-specific-QA
- https://github.com/adityaarunsinghal/Domain-Adaptation/blob/master/scripts/shell_scripts/train_movieR_just_squadv1.sh
- https://github.com/adityaarunsinghal/Domain-Adaptation/
---
layout: model
title: English RobertaForQuestionAnswering (from jgammack)
author: John Snow Labs
name: roberta_qa_roberta_base_squad
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad` is a English model originally trained by `jgammack`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad_en_4.0.0_3.0_1655734821686.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad_en_4.0.0_3.0_1655734821686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base.by_jgammack").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/jgammack/roberta-base-squad
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_4 TFWav2Vec2ForCTC from nimrah
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_4` is a English model originally trained by nimrah.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4_en_4.2.0_3.0_1664116941781.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4_en_4.2.0_3.0_1664116941781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from lucasresck)
author: John Snow Labs
name: distilbert_qa_lucasresck_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `lucasresck`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_lucasresck_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772108616.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_lucasresck_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772108616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lucasresck_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lucasresck_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_lucasresck_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/lucasresck/distilbert-base-uncased-finetuned-squad
---
layout: model
title: NER Pipeline for German
author: John Snow Labs
name: xlm_roberta_large_token_classifier_conll03_pipeline
date: 2022-04-19
tags: [german, roberta, xlm, ner, conll03, de, open_source]
task: Named Entity Recognition
language: de
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [xlm_roberta_large_token_classifier_conll03_de](https://nlp.johnsnowlabs.com/2021/12/25/xlm_roberta_large_token_classifier_conll03_de.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_conll03_pipeline_de_3.4.1_3.0_1650369924733.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_conll03_pipeline_de_3.4.1_3.0_1650369924733.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("xlm_roberta_large_token_classifier_conll03_pipeline", lang = "de")
pipeline.annotate("Ibser begann seine Karriere beim ASK Ebreichsdorf. 2004 wechselte er zu Admira Wacker Mödling, wo er auch in der Akademie spielte.")
```
```scala
val pipeline = new PretrainedPipeline("xlm_roberta_large_token_classifier_conll03_pipeline", lang = "de")
pipeline.annotate("Ibser begann seine Karriere beim ASK Ebreichsdorf. 2004 wechselte er zu Admira Wacker Mödling, wo er auch in der Akademie spielte.")
```
## Results
```bash
+----------------------+---------+
|chunk |ner_label|
+----------------------+---------+
|Ibser |PER |
|ASK Ebreichsdorf |ORG |
|Admira Wacker Mödling |ORG |
+----------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_large_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|1.8 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- XlmRoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Detect Assertion Status from Response to Treatment
author: John Snow Labs
name: assertion_oncology_response_to_treatment_wip
date: 2022-10-11
tags: [licensed, clinical, oncology, en, assertion]
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model detects the assertion status of entities related to response to treatment. The model identifies positive mentions (Present_Or_Past status), and hypothetical or absent mentions (Hypothetical_Or_Absent status).
## Predicted Entities
`Hypothetical_Or_Absent`, `Present_Or_Past`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_response_to_treatment_wip_en_4.0.0_3.0_1665522412809.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_response_to_treatment_wip_en_4.0.0_3.0_1665522412809.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["Response_To_Treatment"])
assertion = AssertionDLModel.pretrained("assertion_oncology_response_to_treatment_wip", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
assertion])
data = spark.createDataFrame([["The patient presented no evidence of recurrence."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Response_To_Treatment"))
val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_response_to_treatment_wip","en","clinical/models")
.setInputCols(Array("sentence","ner_chunk","embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
assertion))
val data = Seq("""The patient presented no evidence of recurrence.""").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert.oncology_response_to_treatment_wip").predict("""The patient presented no evidence of recurrence.""")
```
## Results
```bash
| chunk | ner_label | assertion |
|:-----------|:----------------------|:-----------------------|
| recurrence | Response_To_Treatment | Hypothetical_Or_Absent |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|assertion_oncology_response_to_treatment_wip|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, chunk, embeddings]|
|Output Labels:|[assertion_pred]|
|Language:|en|
|Size:|1.4 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label precision recall f1-score support
Hypothetical_Or_Absent 0.82 0.90 0.86 61.0
Present_Or_Past 0.89 0.80 0.84 61.0
macro-avg 0.86 0.85 0.85 122.0
weighted-avg 0.86 0.85 0.85 122.0
```
---
layout: model
title: English image_classifier_vit_hot_dog_or_sandwich ViTForImageClassification from osanseviero
author: John Snow Labs
name: image_classifier_vit_hot_dog_or_sandwich
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_hot_dog_or_sandwich` is a English model originally trained by osanseviero.
## Predicted Entities
`hot dog`, `sandwich`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_hot_dog_or_sandwich_en_4.1.0_3.0_1660169703998.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_hot_dog_or_sandwich_en_4.1.0_3.0_1660169703998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_hot_dog_or_sandwich", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_hot_dog_or_sandwich", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_hot_dog_or_sandwich|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Spanish DistilBERT Embeddings (from Geotrend)
author: John Snow Labs
name: distilbert_embeddings_distilbert_base_es_cased
date: 2022-04-12
tags: [distilbert, embeddings, es, open_source]
task: Embeddings
language: es
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-es-cased` is a Spanish model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_es_cased_es_3.4.2_3.0_1649783277779.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_es_cased_es_3.4.2_3.0_1649783277779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_es_cased","es") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_es_cased","es")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Me encanta chispa nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.embed.distilbert_base_es_cased").predict("""Me encanta chispa nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_distilbert_base_es_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|es|
|Size:|237.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/distilbert-base-es-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Sentence Entity Resolver for ATC (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_atc
date: 2022-03-01
tags: [atc, licensed, en, clinical, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps drugs entities to ATC (Anatomic Therapeutic Chemical) codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings.
## Predicted Entities
`ATC Codes`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_atc_en_3.4.1_2.4_1646127233333.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_atc_en_3.4.1_2.4_1646127233333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("word_embeddings")
posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "word_embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["DRUG"])
c2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sentence_embeddings")\
.setCaseSensitive(False)
atc_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_atc", "en", "clinical/models")\
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("atc_code")\
.setDistanceFunction("EUCLIDEAN")
resolver_pipeline = Pipeline(
stages = [
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
posology_ner,
ner_converter,
c2doc,
sbert_embedder,
atc_resolver
])
sampleText = ["""He was seen by the endocrinology service and she was discharged on eltrombopag at night, amlodipine with meals metformin two times a day.""",
"""She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide.""",
"""She was given antidepressant for a month"""]
data = spark.createDataFrame(sampleText, StringType()).toDF("text")
results = resolver_pipeline.fit(data).transform(data)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("word_embeddings")
val posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "word_embeddings"))
.setOutputCol("ner")
val ner_converter = NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("DRUG"))
val c2doc = Chunk2Doc()
.setInputCols(Array("ner_chunk"))
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sentence_embeddings")
.setCaseSensitive(False)
val atc_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_atc", "en", "clinical/models")
.setInputCols(Array("sentence_embeddings"))
.setOutputCol("atc_code")
.setDistanceFunction("EUCLIDEAN")
val resolver_pipeline = new PipelineModel().setStages(Array(document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, posology_ner,
ner_converter, c2doc, sbert_embedder, atc_resolver))
val data = Seq("He was seen by the endocrinology service and she was discharged on eltrombopag at night, amlodipine with meals metformin two times a day and then ibuprofen. She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide. She was given antidepressant for a month").toDF("text")
val results = resolver_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.atc").predict("""She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("jsl_ner_wip_greedy_clinical_pipeline", "en", "clinical/models")
text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature..'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("jsl_ner_wip_greedy_clinical_pipeline", "en", "clinical/models")
val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.wip_greedy_clinical.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature..""")
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:-----------------------------------------------|--------:|------:|:-----------------------------|-------------:|
| 0 | 21-day-old | 17 | 26 | Age | 0.9817 |
| 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9998 |
| 2 | male | 38 | 41 | Gender | 0.9922 |
| 3 | for 2 days | 48 | 57 | Duration | 0.6968 |
| 4 | congestion | 62 | 71 | Symptom | 0.875 |
| 5 | mom | 75 | 77 | Gender | 0.8156 |
| 6 | suctioning yellow discharge | 88 | 114 | Symptom | 0.2697 |
| 7 | nares | 135 | 139 | External_body_part_or_region | 0.6216 |
| 8 | she | 147 | 149 | Gender | 0.9965 |
| 9 | mild problems with his breathing while feeding | 168 | 213 | Symptom | 0.444029 |
| 10 | perioral cyanosis | 237 | 253 | Symptom | 0.3283 |
| 11 | retractions | 258 | 268 | Symptom | 0.957 |
| 12 | One day ago | 272 | 282 | RelativeDate | 0.646267 |
| 13 | mom | 285 | 287 | Gender | 0.692 |
| 14 | tactile temperature | 304 | 322 | Symptom | 0.20765 |
| 15 | Tylenol | 345 | 351 | Drug | 0.9951 |
| 16 | Baby | 354 | 357 | Age | 0.981 |
| 17 | decreased p.o. intake | 377 | 397 | Symptom | 0.437375 |
| 18 | His | 400 | 402 | Gender | 0.999 |
| 19 | 20 minutes | 439 | 448 | Duration | 0.20415 |
| 20 | q.2h. | 450 | 454 | Frequency | 0.6406 |
| 21 | to 5 to 10 minutes | 456 | 473 | Duration | 0.12444 |
| 22 | his | 488 | 490 | Gender | 0.9904 |
| 23 | respiratory congestion | 492 | 513 | Symptom | 0.5294 |
| 24 | He | 516 | 517 | Gender | 0.9989 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|jsl_ner_wip_greedy_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk TFWav2Vec2ForCTC from krirk
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk` is a English model originally trained by krirk.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_en_4.2.0_3.0_1664042731612.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_en_4.2.0_3.0_1664042731612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English asr_wav2vec2_xls_r_300m_cv8 TFWav2Vec2ForCTC from comodoro
author: John Snow Labs
name: pipeline_asr_wav2vec2_xls_r_300m_cv8
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_cv8` is a English model originally trained by comodoro.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_cv8_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_cv8_en_4.2.0_3.0_1664014149392.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_cv8_en_4.2.0_3.0_1664014149392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_cv8', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_cv8", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xls_r_300m_cv8|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Pipeline to Detect Genes/Proteins (BC2GM) in Medical Text
author: John Snow Labs
name: bert_token_classifier_ner_bc2gm_gene_pipeline
date: 2023-03-20
tags: [en, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_bc2gm_gene](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_bc2gm_gene_en_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc2gm_gene_pipeline_en_4.3.0_3.2_1679303903870.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc2gm_gene_pipeline_en_4.3.0_3.2_1679303903870.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_bc2gm_gene_pipeline", "en", "clinical/models")
text = '''ROCK-I, Kinectin, and mDia2 can bind the wild type forms of both RhoA and Cdc42 in a GTP-dependent manner in vitro. These results support the hypothesis that in the presence of tryptophan the ribosome translating tnaC blocks Rho ' s access to the boxA and rut sites, thereby preventing transcription termination.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bc2gm_gene_pipeline", "en", "clinical/models")
val text = "ROCK-I, Kinectin, and mDia2 can bind the wild type forms of both RhoA and Cdc42 in a GTP-dependent manner in vitro. These results support the hypothesis that in the presence of tryptophan the ribosome translating tnaC blocks Rho ' s access to the boxA and rut sites, thereby preventing transcription termination."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------|--------:|------:|:-------------|-------------:|
| 0 | ROCK-I | 0 | 5 | GENE/PROTEIN | 0.999978 |
| 1 | Kinectin | 8 | 15 | GENE/PROTEIN | 0.999973 |
| 2 | mDia2 | 22 | 26 | GENE/PROTEIN | 0.999974 |
| 3 | RhoA | 65 | 68 | GENE/PROTEIN | 0.999976 |
| 4 | Cdc42 | 74 | 78 | GENE/PROTEIN | 0.999979 |
| 5 | tnaC | 213 | 216 | GENE/PROTEIN | 0.999978 |
| 6 | Rho | 225 | 227 | GENE/PROTEIN | 0.999976 |
| 7 | boxA | 247 | 250 | GENE/PROTEIN | 0.999837 |
| 8 | rut sites | 256 | 264 | GENE/PROTEIN | 0.99115 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_bc2gm_gene_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: Recognize Entities DL Pipeline for Danish - Small
author: John Snow Labs
name: entity_recognizer_sm
date: 2021-03-22
tags: [open_source, danish, entity_recognizer_sm, pipeline, da]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: da
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_da_3.0.0_3.0_1616443414871.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_da_3.0.0_3.0_1616443414871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'da')
annotations = pipeline.fullAnnotate(""Hej fra John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "da")
val result = pipeline.fullAnnotate("Hej fra John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hej fra John Snow Labs! ""]
result_df = nlu.load('da.ner').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Hej fra John Snow Labs! '] | ['Hej fra John Snow Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | [[0.0306969992816448,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_sm|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|da|
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from sumedh)
author: John Snow Labs
name: t5_base_amazonreviews
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-amazonreviews` is a English model originally trained by `sumedh`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_amazonreviews_en_4.3.0_3.0_1675107991189.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_amazonreviews_en_4.3.0_3.0_1675107991189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_base_amazonreviews","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_base_amazonreviews","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_base_amazonreviews|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|921.3 MB|
## References
- https://huggingface.co/sumedh/t5-base-amazonreviews
---
layout: model
title: Recognize Entities OntoNotes pipeline - BERT Mini
author: John Snow Labs
name: onto_recognize_entities_bert_mini
date: 2021-03-23
tags: [open_source, english, onto_recognize_entities_bert_mini, pipeline, en]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The onto_recognize_entities_bert_mini is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_mini_en_3.0.0_3.0_1616477436682.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_mini_en_3.0.0_3.0_1616477436682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('onto_recognize_entities_bert_mini', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_mini", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.ner.onto.bert.mini').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------|
| 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[-0.147406503558158,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|onto_recognize_entities_bert_mini|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_2_h_512
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-2_H-512` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_512_zh_4.2.4_3.0_1670021628805.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_512_zh_4.2.4_3.0_1670021628805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_512","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_512","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_2_h_512|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|66.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-2_H-512
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Spanish BertForQuestionAnswering model (from IIC)
author: John Snow Labs
name: bert_qa_beto_base_spanish_sqac
date: 2022-06-02
tags: [es, open_source, question_answering, bert]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `beto-base-spanish-sqac` is a Spanish model orginally trained by `IIC`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_beto_base_spanish_sqac_es_4.0.0_3.0_1654185522043.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_beto_base_spanish_sqac_es_4.0.0_3.0_1654185522043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_beto_base_spanish_sqac","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_beto_base_spanish_sqac","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.sqac.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_beto_base_spanish_sqac|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|410.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/IIC/beto-base-spanish-sqac
- https://paperswithcode.com/sota?task=question-answering&dataset=PlanTL-GOB-ES%2FSQAC
- https://arxiv.org/abs/2107.07253
- https://github.com/dccuchile/beto
- https://www.bsc.es/
---
layout: model
title: Persian XlmRoBertaForQuestionAnswering (from SajjadAyoubi)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_large_fa_qa
date: 2022-06-23
tags: [fa, open_source, question_answering, xlmroberta]
task: Question Answering
language: fa
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-fa-qa` is a Persian model originally trained by `SajjadAyoubi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_fa_qa_fa_4.0.0_3.0_1655995556403.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_fa_qa_fa_4.0.0_3.0_1655995556403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_large_fa_qa","fa") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_large_fa_qa","fa")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("fa.answer_question.xlm_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_large_fa_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|fa|
|Size:|1.9 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/SajjadAyoubi/xlm-roberta-large-fa-qa
- https://colab.research.google.com/github/sajjjadayobi/PersianQA/blob/main/notebooks/HowToUse.ipynb
---
layout: model
title: English Deberta Embeddings model (from domenicrosati)
author: John Snow Labs
name: deberta_embeddings_xsmall_dapt_scientific_papers_pubmed
date: 2023-03-13
tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow]
task: Embeddings
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DeBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-xsmall-dapt-scientific-papers-pubmed` is a English model originally trained by `domenicrosati`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_xsmall_dapt_scientific_papers_pubmed_en_4.3.1_3.0_1678701718282.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_xsmall_dapt_scientific_papers_pubmed_en_4.3.1_3.0_1678701718282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_xsmall_dapt_scientific_papers_pubmed","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_xsmall_dapt_scientific_papers_pubmed","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|deberta_embeddings_xsmall_dapt_scientific_papers_pubmed|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|246.9 MB|
|Case sensitive:|false|
## References
https://huggingface.co/domenicrosati/deberta-xsmall-dapt-scientific-papers-pubmed
---
layout: model
title: Sentence Detection in English Texts
author: John Snow Labs
name: sentence_detector_dl
date: 2021-01-02
task: Sentence Detection
language: en
nav_key: models
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [en, sentence_detection, open_source]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/SENTENCE_DETECTOR/){:.button.button-orange.button-orange-trans.arr.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/20.SentenceDetectorDL_Healthcare.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_en_2.7.0_2.4_1609611052663.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_en_2.7.0_2.4_1609611052663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl", "en") \
.setInputCols(["document"]) \
.setOutputCol("sentences")
sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL]))
sd_model.fullAnnotate("""John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.""")
```
```scala
val documenter = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val pipeline = new Pipeline().setStages(Array(documenter, model))
val data = Seq("John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("sentence_detector").predict("""John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.""")
```
## Results
```bash
+---+------------------------------+
| 0 | John loves Mary. |
+---+------------------------------+
| 1 | Mary loves Peter |
+---+------------------------------+
| 2 | Peter loves Helen . |
+---+------------------------------+
| 3 | Helen loves John; |
+---+------------------------------+
| 4 | Total: four people involved. |
+---+------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sentence_detector_dl|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[sentences]|
|Language:|en|
## Data Source
Please visit the repo for more information https://github.com/dbmdz/deep-eos
## Benchmarking
```bash
label Accuracy Recall Prec F1
0 0.98 1.00 0.96 0.98
```
---
layout: model
title: Fast Neural Machine Translation Model from Afrikaans to Esperanto
author: John Snow Labs
name: opus_mt_af_eo
date: 2021-06-01
tags: [open_source, seq2seq, translation, af, eo, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: af
target languages: eo
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_eo_xx_3.1.0_2.4_1622559306014.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_eo_xx_3.1.0_2.4_1622559306014.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_af_eo", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_af_eo", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Afrikaans.translate_to.Esperanto').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_af_eo|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from kuberpmu)
author: John Snow Labs
name: distilbert_qa_kuberpmu_base_cased_led_squad_finetuned
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad` is a English model originally trained by `kuberpmu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kuberpmu_base_cased_led_squad_finetuned_en_4.3.0_3.0_1672766528239.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kuberpmu_base_cased_led_squad_finetuned_en_4.3.0_3.0_1672766528239.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kuberpmu_base_cased_led_squad_finetuned","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kuberpmu_base_cased_led_squad_finetuned","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_kuberpmu_base_cased_led_squad_finetuned|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/kuberpmu/distilbert-base-cased-distilled-squad-finetuned-squad
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from ab20211112)
author: John Snow Labs
name: distilbert_qa_ab20211112_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ab20211112`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ab20211112_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769623851.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ab20211112_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769623851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ab20211112_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ab20211112_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_ab20211112_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/ab20211112/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English BertForQuestionAnswering Cased model (from motiondew)
author: John Snow Labs
name: bert_qa_sd1
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-sd1` is a English model originally trained by `motiondew`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sd1_en_4.0.0_3.0_1657187990798.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sd1_en_4.0.0_3.0_1657187990798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sd1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sd1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_sd1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/motiondew/bert-sd1
---
layout: model
title: Pipeline to Detect Clinical Entities (Slim version)
author: John Snow Labs
name: bert_token_classifier_ner_jsl_slim_pipeline
date: 2022-03-21
tags: [licensed, ner, slim, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_jsl_slim](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_jsl_slim_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_slim_pipeline_en_3.4.1_3.0_1647865346100.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_slim_pipeline_en_3.4.1_3.0_1647865346100.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_slim_pipeline", "en", "clinical/models")
pipeline.annotate("HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.")
```
```scala
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_slim_pipeline", "en", "clinical/models")
pipeline.annotate("HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.jsl_slim.pipeline").predict("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""")
```
## Results
```bash
+----------------+------------+
|chunk |ner_label |
+----------------+------------+
|HISTORY: |Header |
|30-year-old |Age |
|female |Demographics|
|mammography |Test |
|soft tissue lump|Symptom |
|shoulder |Body_Part |
|breast cancer |Oncological |
|her mother |Demographics|
|age 58 |Age |
|breast cancer |Oncological |
+----------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_jsl_slim_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.8 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverter
---
layout: model
title: Detect PHI for Deidentification purposes (Spanish, reduced entities, augmented data, Roberta)
author: John Snow Labs
name: ner_deid_generic_roberta_augmented
date: 2022-02-16
tags: [deid, es, licensed]
task: De-identification
language: es
edition: Healthcare NLP 3.3.4
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities (1 more than the `ner_deid_generic_roberta` ner model).
This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset, several data augmentation mechanisms and has been augmented with MEDDOCAN Spanish Deidentification corpus (compared to `ner_deid_generic_roberta` which does not include it). It's a generalized version of `ner_deid_subentity_roberta_augmented`.
This is a Roberta embeddings based model. You also have available the `ner_deid_generic_augmented` that uses Sciwi 300d embeddings.
## Predicted Entities
`CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE`, `SEX`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_roberta_augmented_es_3.3.4_3.0_1645006281743.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_roberta_augmented_es_3.3.4_3.0_1645006281743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_roberta_augmented", "es", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
roberta_embeddings,
clinical_ner])
text = ['''
Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
''']
df = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(df).transform(df)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
roberta_embeddings,
clinical_ner))
val text = "Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."
val df = Seq(text).toDF("text")
val results = pipeline.fit(df).transform(df)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.med_ner.deid.generic.roberta").predict("""
Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_causal_qa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_causal_qa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.by_manav").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_causal_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/manav/causal_qa
- https://github.com/kstats/CausalQG
---
layout: model
title: English image_classifier_vit_base_patch16_224_in21k_snacks ViTForImageClassification from matteopilotto
author: John Snow Labs
name: image_classifier_vit_base_patch16_224_in21k_snacks
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_in21k_snacks` is a English model originally trained by matteopilotto.
## Predicted Entities
`salad`, `candy`, `muffin`, `banana`, `grape`, `popcorn`, `pretzel`, `pineapple`, `juice`, `orange`, `doughnut`, `carrot`, `waffle`, `cake`, `cookie`, `ice cream`, `watermelon`, `hot dog`, `apple`, `strawberry`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_snacks_en_4.1.0_3.0_1660167587853.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_snacks_en_4.1.0_3.0_1660167587853.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_base_patch16_224_in21k_snacks", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_base_patch16_224_in21k_snacks", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_base_patch16_224_in21k_snacks|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|322.0 MB|
---
layout: model
title: BioBERT Sentence Embeddings (Pubmed)
author: John Snow Labs
name: sent_biobert_pubmed_base_cased
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_base_cased_en_2.6.0_2.4_1598348028762.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_base_cased_en_2.6.0_2.4_1598348028762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.biobert.pubmed_base_cased').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
en_embed_sentence_biobert_pubmed_base_cased_embeddings sentence
[0.209750697016716, 0.21535921096801758, -0.59... I hate cancer
[0.01466107927262783, -0.20778851211071014, -0... Antibiotics aren't painkiller
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_biobert_pubmed_base_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert)
---
layout: model
title: German Named Entity Recognition
author: John Snow Labs
name: xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german
date: 2022-05-17
tags: [xlm_roberta, ner, token_classification, de, open_source]
task: Named Entity Recognition
language: de
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP `xlm-roberta-large-finetuned-conll03-german` is a German model orginally trained by HuggingFace.
## Predicted Entities
`PER`, `ORG`, `MISC`, `LOC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german_de_3.4.2_3.0_1652807937775.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german_de_3.4.2_3.0_1652807937775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german","de") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german","de")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Ich liebe Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|1.8 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/xlm-roberta-large-finetuned-conll03-german
---
layout: model
title: Telugu RobertaForMaskedLM Cased model (from neuralspace-reverie)
author: John Snow Labs
name: roberta_embeddings_indic_transformers
date: 2022-12-12
tags: [te, open_source, roberta_embeddings, robertaformaskedlm]
task: Embeddings
language: te
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-te-roberta` is a Telugu model originally trained by `neuralspace-reverie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_te_4.2.4_3.0_1670858686031.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_te_4.2.4_3.0_1670858686031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers","te") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers","te")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_indic_transformers|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|te|
|Size:|314.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/neuralspace-reverie/indic-transformers-te-roberta
- https://oscar-corpus.com/
---
layout: model
title: Spanish Text Classification (from `hackathon-pln-es`)
author: John Snow Labs
name: roberta_jurisbert_clas_art_convencion_americana_dh
date: 2022-05-20
tags: [roberta, text_classification, es, open_source]
task: Text Classification
language: es
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: RoBertaForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `jurisbert-clas-art-convencion-americana-dh` is a Spanish model orginally trained by `hackathon-pln-es`.
## Predicted Entities
`Artículo 63.1 Reparaciones`, `Artículo 15. Derecho de Reunión`, `Artículo 4. Derecho a la Vida`, `Artículo 1. Obligación de Respetar los Derechos`, `Artículo 5. Derecho a la Integridad Personal`, `Artículo 8. Garantías Judiciales`, `Artículo 19. Derechos del Niño`, `Artículo 17. Protección a la Familia`, `Artículo 2. Deber de Adoptar Disposiciones de Derecho Interno`, `Artículo 16. Libertad de Asociación`, `Artículo 25. Protección Judicial`, `Artículo 11. Protección de la Honra y de la Dignidad`, `Artículo 12. Libertad de Conciencia y de Religión`, `Artículo 9. Principio de Legalidad y de Retroactividad`, `Artículo 7. Derecho a la Libertad Personal`, `Artículo 24. Igualdad ante la Ley`, `Artículo 6. Prohibición de la Esclavitud y Servidumbre`, `Artículo 22. Derecho de Circulación y de Residencia`, `Artículo 28. Cláusula Federal`, `Artículo 21. Derecho a la Propiedad Privada`, `Artículo_29_Normas_de_Interpretación`, `Artículo 23. Derechos Políticos`, `Artículo 13. Libertad de Pensamiento y de Expresión`, `Artículo 26. Desarrollo Progresivo`, `Artículo 30. Alcance de las Restricciones`, `Artículo 14. Derecho de Rectificación o Respuesta`, `Artículo 3. Derecho al Reconocimiento de la Personalidad Jurídica`, `Artículo 27. Suspensión de Garantías`, `Artículo 20. Derecho a la Nacionalidad`, `Artículo 18. Derecho al Nombre`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_jurisbert_clas_art_convencion_americana_dh_es_3.4.4_3.0_1653049484318.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_jurisbert_clas_art_convencion_americana_dh_es_3.4.4_3.0_1653049484318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = RoBertaForSequenceClassification.pretrained("roberta_jurisbert_clas_art_convencion_americana_dh","es") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Me encanta Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = RoBertaForSequenceClassification.pretrained("roberta_jurisbert_clas_art_convencion_americana_dh","es")
.setInputCols(Array("sentence", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Me encanta Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_jurisbert_clas_art_convencion_americana_dh|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|466.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
https://huggingface.co/hackathon-pln-es/jurisbert-clas-art-convencion-americana-dh
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739325907.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739325907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base_ruletriplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|460.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0
---
layout: model
title: Pipeline to Detect diseases in Text (large)
author: John Snow Labs
name: ner_diseases_large_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, disease, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_diseases_large](https://nlp.johnsnowlabs.com/2021/04/01/ner_diseases_large_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_large_pipeline_en_3.4.1_3.0_1647872024826.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_large_pipeline_en_3.4.1_3.0_1647872024826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_diseases_large_pipeline", "en", "clinical/models")
pipeline.annotate("""Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""")
```
```scala
val pipeline = new PretrainedPipeline("ner_diseases_large_pipeline", "en", "clinical/models")
pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.diseases_large.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""")
```
## Results
```bash
+----------------------------+---------+
|chunk |ner_label|
+----------------------------+---------+
|Multiple autoimmune syndrome|Disease |
|T-cell leukemia |Disease |
|T-cell leukemia |Disease |
|Chikungunya virus disease |Disease |
+----------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_diseases_large_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Drug Side Effect Classification Pipeline - Voice of the Patient
author: John Snow Labs
name: bert_sequence_classifier_vop_drug_side_effect_pipeline
date: 2023-06-14
tags: [clinical, licensed, en, classification, vop]
task: Text Classification
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline includes the Medical Bert for Sequence Classification model to classify health-related text in colloquial language according to the presence or absence of mentions of side effects related to drugs. The pipeline is built on the top of [bert_sequence_classifier_vop_drug_side_effect](https://nlp.johnsnowlabs.com/2023/06/13/bert_sequence_classifier_vop_drug_side_effect_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_drug_side_effect_pipeline_en_4.4.3_3.2_1686704779005.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_drug_side_effect_pipeline_en_4.4.3_3.2_1686704779005.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_sequence_classifier_vop_drug_side_effect_pipeline", "en", "clinical/models")
pipeline.annotate("I felt kind of dizzy after taking that medication for a month.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_sequence_classifier_vop_drug_side_effect_pipeline", "en", "clinical/models")
val result = pipeline.annotate(I felt kind of dizzy after taking that medication for a month.)
```
## Results
```bash
| text | prediction |
|:---------------------------------------------------------------|:-------------|
| I felt kind of dizzy after taking that medication for a month. | Drug_AE |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_vop_drug_side_effect_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|406.4 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- MedicalBertForSequenceClassification
---
layout: model
title: Spanish RobertaForQuestionAnswering Base Cased model (from JonatanGk)
author: John Snow Labs
name: roberta_qa_jonatangk_base_bne_finetuned_s_c
date: 2023-01-20
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-finetuned-sqac` is a Spanish model originally trained by `JonatanGk`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_jonatangk_base_bne_finetuned_s_c_es_4.3.0_3.0_1674213010026.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_jonatangk_base_bne_finetuned_s_c_es_4.3.0_3.0_1674213010026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_jonatangk_base_bne_finetuned_s_c","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_jonatangk_base_bne_finetuned_s_c","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_jonatangk_base_bne_finetuned_s_c|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|460.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/JonatanGk/roberta-base-bne-finetuned-sqac
---
layout: model
title: Legal Other remedies Clause Binary Classifier
author: John Snow Labs
name: legclf_other_remedies_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `other-remedies` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `other-remedies`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_other_remedies_clause_en_1.0.0_3.2_1660122803750.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_other_remedies_clause_en_1.0.0_3.2_1660122803750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[other-remedies]|
|[other]|
|[other]|
|[other-remedies]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_other_remedies_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.1 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.93 0.95 0.94 98
other-remedies 0.86 0.82 0.84 38
accuracy - - 0.91 136
macro-avg 0.90 0.88 0.89 136
weighted-avg 0.91 0.91 0.91 136
```
---
layout: model
title: German NER for Laws (Bert, Base)
author: John Snow Labs
name: legner_bert_base_courts
date: 2022-10-02
tags: [de, legal, ner, laws, court, licensed]
task: Named Entity Recognition
language: de
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model can be used to detect legal entities in German text, predicting up to 19 different labels:
```
| tag | meaning
-----------------
| AN | Anwalt
| EUN | Europäische Norm
| GS | Gesetz
| GRT | Gericht
| INN | Institution
| LD | Land
| LDS | Landschaft
| LIT | Literatur
| MRK | Marke
| ORG | Organisation
| PER | Person
| RR | Richter
| RS | Rechtssprechung
| ST | Stadt
| STR | Straße
| UN | Unternehmen
| VO | Verordnung
| VS | Vorschrift
| VT | Vertrag
```
German Named Entity Recognition model, trained using large German Base Bert model and finetuned using Court Decisions (2017-2018) dataset (check `Data Source` section). You can also find a lighter Deep Learning (non-transformer based) in our Models Hub (`legner_courts`) and a Bert Large version (`legner_bert_large_courts`).
## Predicted Entities
`STR`, `LIT`, `PER`, `EUN`, `VT`, `MRK`, `INN`, `UN`, `RS`, `ORG`, `GS`, `VS`, `LDS`, `GRT`, `VO`, `RR`, `LD`, `AN`, `ST`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_LEGAL_DE/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_bert_base_courts_de_1.0.0_3.0_1664708306072.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_bert_base_courts_de_1.0.0_3.0_1664708306072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
ner_model = legal.BertForTokenClassification.pretrained("legner_bert_base_courts", "de", "legal/models")\
.setInputCols(["document", "token"])\
.setOutputCol("ner")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_converter = nlp.NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
ner_model,
ner_converter
])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
text_list = ["""Der Europäische Gerichtshof für Menschenrechte (EGMR) gibt dabei allerdings ebenso wenig wie das Bundesverfassungsgericht feste Fristen vor, sondern stellt auf die jeweiligen Umstände des Einzelfalls ab.""",
"""Formelle Rechtskraft ( § 705 ZPO ) trat mit Verkündung des Revisionsurteils am 15. Dezember 2016 ein (vgl. Zöller / Seibel ZPO 32. Aufl. § 705 Rn. 8) ."""]
df = spark.createDataFrame(pd.DataFrame({"text" : text_list}))
result = model.transform(df)
result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("ner_chunk"),
F.expr("cols['1']['entity']").alias("label")).show(truncate = False)
```
## Results
```bash
+------------------------------------------+-----+
|ner_chunk |label|
+------------------------------------------+-----+
|Europäische Gerichtshof für Menschenrechte|GRT |
|EGMR |GRT |
|Bundesverfassungsgericht |GRT |
|§ 705 ZPO |GS |
|Zöller / Seibel ZPO 32. Aufl. § 705 Rn. 8 |LIT |
+------------------------------------------+-----+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_bert_base_courts|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|407.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
The dataset used to train this model is taken from Leitner, et.al (2019)
Leitner, E., Rehm, G., and Moreno-Schneider, J. (2019). Fine-grained Named Entity Recognition in Legal Documents. In Maribel Acosta, et al., editors, Semantic Systems. The Power of AI and Knowledge Graphs. Proceedings of the 15th International Conference (SEMANTiCS2019), number 11702 in Lecture Notes in Computer Science, pages 272–287, Karlsruhe, Germany, 9. Springer. 10/11 September 2019.
Source of the annotated text:
Court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection. The documents originate from seven federal courts: Federal Labour Court (BAG), Federal Fiscal Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG).
## Benchmarking
```bash
label precision recall f1-score support
AN 0.82 0.61 0.70 23
EUN 0.90 0.93 0.92 210
GRT 0.95 0.98 0.96 445
GS 0.97 0.98 0.98 2739
INN 0.87 0.88 0.88 321
LD 0.92 0.94 0.93 189
LDS 0.44 0.73 0.55 26
LIT 0.85 0.91 0.88 449
MRK 0.40 0.86 0.55 44
ORG 0.72 0.79 0.76 184
PER 0.71 0.91 0.80 260
RR 0.73 0.58 0.65 208
RS 0.95 0.97 0.96 1859
ST 0.81 0.94 0.87 120
STR 0.69 0.69 0.69 26
UN 0.73 0.84 0.78 158
VO 0.82 0.86 0.84 107
VS 0.48 0.81 0.60 86
VT 0.90 0.87 0.89 442
micro-avg 0.90 0.93 0.92 7896
macro-avg 0.77 0.85 0.80 7896
weighted-avg 0.91 0.93 0.92 7896
```
---
layout: model
title: Word2Vec Embeddings in Malay (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, ms, open_source]
task: Embeddings
language: ms
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ms_3.4.1_3.0_1647445079012.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ms_3.4.1_3.0_1647445079012.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ms") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Saya suka Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ms")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Saya suka Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ms.embed.w2v_cc_300d").predict("""Saya suka Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|ms|
|Size:|700.4 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Legal Documentation Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_documentation_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, documentation, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_documentation_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Documentation or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Documentation`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_documentation_bert_en_1.0.0_3.0_1678111855792.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_documentation_bert_en_1.0.0_3.0_1678111855792.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Documentation]|
|[Other]|
|[Other]|
|[Documentation]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_documentation_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Documentation 0.81 0.83 0.82 83
Other 0.87 0.85 0.86 107
accuracy - - 0.84 190
macro-avg 0.84 0.84 0.84 190
weighted-avg 0.84 0.84 0.84 190
```
---
layout: model
title: English ElectraForQuestionAnswering model (from mbartolo)
author: John Snow Labs
name: electra_qa_large_synqa
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-large-synqa` is a English model originally trained by `mbartolo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_large_synqa_en_4.0.0_3.0_1655921148669.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_large_synqa_en_4.0.0_3.0_1655921148669.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_large_synqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_large_synqa","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.synqa.electra.large").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_large_synqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mbartolo/electra-large-synqa
---
layout: model
title: Sentence Detection in Telugu Text
author: John Snow Labs
name: sentence_detector_dl
date: 2021-08-30
tags: [te, sentence_detection, open_source]
task: Embeddings
language: te
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_te_3.2.0_3.0_1630338728542.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_te_3.2.0_3.0_1630338728542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl", "te") \
.setInputCols(["document"]) \
.setOutputCol("sentences")
sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL]))
sd_model.fullAnnotate("""ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా? మీరు సరైన స్థలానికి వచ్చారు. ఇటీవలి అధ్యయనం ప్రకారం, నేటి యువతలో చదివే అలవాటు వేగంగా తగ్గుతోంది. వారు కొన్ని సెకన్ల కంటే ఎక్కువ ఇచ్చిన ఆంగ్ల పఠన పేరాపై దృష్టి పెట్టలేరు! అలాగే, చదవడం అనేది అన్ని పోటీ పరీక్షలలో అంతర్భాగం. కాబట్టి, మీరు మీ పఠన నైపుణ్యాలను ఎలా మెరుగుపరుచుకుంటారు? ఈ ప్రశ్నకు సమాధానం నిజానికి మరొక ప్రశ్న: పఠన నైపుణ్యాల ఉపయోగం ఏమిటి? చదవడం యొక్క ముఖ్య ఉద్దేశ్యం 'అర్థం చేసుకోవడం'.""")
```
```scala
val documenter = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "te")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val pipeline = new Pipeline().setStages(Array(documenter, model))
val data = Seq("ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా? మీరు సరైన స్థలానికి వచ్చారు. ఇటీవలి అధ్యయనం ప్రకారం, నేటి యువతలో చదివే అలవాటు వేగంగా తగ్గుతోంది. వారు కొన్ని సెకన్ల కంటే ఎక్కువ ఇచ్చిన ఆంగ్ల పఠన పేరాపై దృష్టి పెట్టలేరు! అలాగే, చదవడం అనేది అన్ని పోటీ పరీక్షలలో అంతర్భాగం. కాబట్టి, మీరు మీ పఠన నైపుణ్యాలను ఎలా మెరుగుపరుచుకుంటారు? ఈ ప్రశ్నకు సమాధానం నిజానికి మరొక ప్రశ్న: పఠన నైపుణ్యాల ఉపయోగం ఏమిటి? చదవడం యొక్క ముఖ్య ఉద్దేశ్యం 'అర్థం చేసుకోవడం'.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load('te.sentence_detector').predict("ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా? మీరు సరైన స్థలానికి వచ్చారు. ఇటీవలి అధ్యయనం ప్రకారం, నేటి యువతలో చదివే అలవాటు వేగంగా తగ్గుతోంది. వారు కొన్ని సెకన్ల కంటే ఎక్కువ ఇచ్చిన ఆంగ్ల పఠన పేరాపై దృష్టి పెట్టలేరు! అలాగే, చదవడం అనేది అన్ని పోటీ పరీక్షలలో అంతర్భాగం. కాబట్టి, మీరు మీ పఠన నైపుణ్యాలను ఎలా మెరుగుపరుచుకుంటారు? ఈ ప్రశ్నకు సమాధానం నిజానికి మరొక ప్రశ్న: పఠన నైపుణ్యాల ఉపయోగం ఏమిటి? చదవడం యొక్క ముఖ్య ఉద్దేశ్యం 'అర్థం చేసుకోవడం'.", output_level ='sentence')
```
## Results
```bash
+--------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------+
|[ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా?] |
|[మీరు సరైన స్థలానికి వచ్చారు.] |
|[ఇటీవలి అధ్యయనం ప్రకారం, నేటి యువతలో చదివే అలవాటు వేగంగా తగ్గుతోంది.] |
|[వారు కొన్ని సెకన్ల కంటే ఎక్కువ ఇచ్చిన ఆంగ్ల పఠన పేరాపై దృష్టి పెట్టలేరు!]|
|[అలాగే, చదవడం అనేది అన్ని పోటీ పరీక్షలలో అంతర్భాగం.] |
|[కాబట్టి, మీరు మీ పఠన నైపుణ్యాలను ఎలా మెరుగుపరుచుకుంటారు?] |
|[ఈ ప్రశ్నకు సమాధానం నిజానికి మరొక ప్రశ్న:] |
|[పఠన నైపుణ్యాల ఉపయోగం ఏమిటి?] |
|[చదవడం యొక్క ముఖ్య ఉద్దేశ్యం 'అర్థం చేసుకోవడం'.] |
+--------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sentence_detector_dl|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[sentences]|
|Language:|te|
## Benchmarking
```bash
Accuracy: 0.98
Recall: 1.00
Precision: 0.96
F1: 0.98
```
---
layout: model
title: Danish Lemmatizer
author: John Snow Labs
name: lemma
date: 2020-07-29 23:34:00 +0800
task: Lemmatization
language: da
edition: Spark NLP 2.5.5
spark_version: 2.4
tags: [lemmatizer, da]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_da_2.5.5_2.4_1596054395311.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_da_2.5.5_2.4_1596054395311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
lemmatizer = LemmatizerModel.pretrained("lemma", "da") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("John Snow er bortset fra at være kongen i nord, en engelsk læge og en leder inden for udvikling af anæstesi og medicinsk hygiejne.")
```
```scala
...
val lemmatizer = LemmatizerModel.pretrained("lemma", "da")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("John Snow er bortset fra at være kongen i nord, en engelsk læge og en leder inden for udvikling af anæstesi og medicinsk hygiejne.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""John Snow er bortset fra at være kongen i nord, en engelsk læge og en leder inden for udvikling af anæstesi og medicinsk hygiejne."""]
lemma_df = nlu.load('da.lemma').predict(text, output_level='document')
lemma_df.lemma.values[0]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=3, result='John', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=5, end=8, result='Snow', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=10, end=11, result='være', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=13, end=19, result='bortset', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=21, end=23, result='fra', metadata={'sentence': '0'}, embeddings=[]),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Type:|lemmatizer|
|Compatibility:|Spark NLP 2.5.5+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[lemma]|
|Language:|da|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Stopwords Remover for Slovene language (319 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, sl, open_source]
task: Stop Words Removal
language: sl
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_sl_3.4.1_3.0_1646672307800.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_sl_3.4.1_3.0_1646672307800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","sl") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Nisi boljši od mene"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","sl")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Nisi boljši od mene").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("sl.stopwords").predict("""Nisi boljši od mene""")
```
## Results
```bash
+--------------+
|result |
+--------------+
|[Nisi, boljši]|
+--------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|sl|
|Size:|2.1 KB|
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_nl8
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl8` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl8_en_4.3.0_3.0_1675123172323.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl8_en_4.3.0_3.0_1675123172323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_nl8","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_nl8","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_nl8|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|176.0 MB|
## References
- https://huggingface.co/google/t5-efficient-small-nl8
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Embeddings Healthcare 100 dims
author: John Snow Labs
name: embeddings_healthcare_100d
class: WordEmbeddingsModel
language: en
nav_key: models
repository: clinical/models
date: 2020-05-29
task: Embeddings
edition: Healthcare NLP 2.5.0
spark_version: 2.4
tags: [clinical,embeddings,en]
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_healthcare_100d_en_2.5.0_2.4_1590794626292.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_healthcare_100d_en_2.5.0_2.4_1590794626292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
model = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
```
```scala
val model = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")
.setInputCols("document","token")
.setOutputCol("word_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.glove.healthcare_100d").predict("""Put your text here.""")
```
{:.h2_title}
## Results
Word2Vec feature vectors based on ``embeddings_healthcare_100d``.
{:.model-param}
## Model Information
{:.table-model}
|---------------|----------------------------|
| Name: | embeddings_healthcare_100d |
| Type: | WordEmbeddingsModel |
| Compatibility: | Spark NLP 2.5.0+ |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [document, token] |
|Output labels: | [word_embeddings] |
| Language: | en |
| Dimension: | 100.0 |
{:.h2_title}
## Data Source
Trained on PubMed + ICD10 + UMLS + MIMIC III corpora
https://www.nlm.nih.gov/databases/download/pubmed_medline.html
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from horsbug98)
author: John Snow Labs
name: xlm_roberta_qa_Part_1_XLM_Model_E1
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_1_XLM_Model_E1` is a English model originally trained by `horsbug98`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_Part_1_XLM_Model_E1_en_4.0.0_3.0_1655983332305.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_Part_1_XLM_Model_E1_en_4.0.0_3.0_1655983332305.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_Part_1_XLM_Model_E1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_Part_1_XLM_Model_E1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.tydiqa.xlm_roberta.by_horsbug98").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_Part_1_XLM_Model_E1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|877.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/horsbug98/Part_1_XLM_Model_E1
---
layout: model
title: English Bert Embeddings Cased model (from nlpie)
author: John Snow Labs
name: bert_embeddings_distil_clinical
date: 2023-02-22
tags: [open_source, bert, bert_embeddings, bertformaskedlm, en, tensorflow]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distil-clinicalbert` is a English model originally trained by `nlpie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_distil_clinical_en_4.3.0_3.0_1677088459443.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_distil_clinical_en_4.3.0_3.0_1677088459443.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_distil_clinical","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark-NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_distil_clinical","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark-NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_distil_clinical|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|247.1 MB|
|Case sensitive:|true|
## References
https://huggingface.co/nlpie/distil-clinicalbert
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0_en_4.3.0_3.0_1674214155549.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0_en_4.3.0_3.0_1674214155549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|416.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-0
---
layout: model
title: Detect Drugs and Posology Entities (ner_posology_greedy)
author: John Snow Labs
name: ner_posology_greedy
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model detects drugs, dosage, form, frequency, duration, route, and drug strength in text. It differs from `ner_posology` in the sense that it chunks together drugs, dosage, form, strength, and route when they appear together, resulting in a bigger chunk. It is trained using `embeddings_clinical` so please use the same embeddings in the pipeline.
## Predicted Entities
`DRUG`, `STRENGTH`, `DURATION`, `FREQUENCY`, `FORM`, `DOSAGE`, `ROUTE`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_en_3.0.0_3.0_1617208415393.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_en_3.0.0_3.0_1617208415393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day."]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter))
val data = Seq("""The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.posology.greedy").predict("""The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.""")
```
## Results
```bash
+----+----------------------------------+---------+-------+------------+
| | chunks | begin | end | entities |
|---:|---------------------------------:|--------:|------:|-----------:|
| 0 | 1 capsule of Advil 10 mg | 27 | 50 | DRUG |
| 1 | magnesium hydroxide 100mg/1ml PO | 67 | 98 | DRUG |
| 2 | for 5 days | 52 | 61 | DURATION |
| 3 | 40 units of insulin glargine | 168 | 195 | DRUG |
| 4 | at night | 197 | 204 | FREQUENCY |
| 5 | 12 units of insulin lispro | 207 | 232 | DRUG |
| 6 | with meals | 234 | 243 | FREQUENCY |
| 7 | metformin 1000 mg | 250 | 266 | DRUG |
| 8 | two times a day | 268 | 282 | FREQUENCY |
+----+----------------------------------+---------+-------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_posology_greedy|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
Trained on augmented i2b2_med7 + FDA dataset with ``embeddings_clinical``, [https://www.i2b2.org/NLP/Medication](https://www.i2b2.org/NLP/Medication).
---
layout: model
title: Word2Vec Embeddings in Occitan (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, oc, open_source]
task: Embeddings
language: oc
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_oc_3.4.1_3.0_1647450923281.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_oc_3.4.1_3.0_1647450923281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","oc") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","oc")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("oc.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|oc|
|Size:|459.6 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: English Named Entity Recognition (from abhishek)
author: John Snow Labs
name: bert_ner_autonlp_prodigy_10_3362554
date: 2022-05-09
tags: [bert, ner, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `autonlp-prodigy-10-3362554` is a English model orginally trained by `abhishek`.
## Predicted Entities
`LOCATION`, `PERSON`, `ORG`, `PRODUCT`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_autonlp_prodigy_10_3362554_en_3.4.2_3.0_1652097317068.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_autonlp_prodigy_10_3362554_en_3.4.2_3.0_1652097317068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_autonlp_prodigy_10_3362554","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_autonlp_prodigy_10_3362554","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_autonlp_prodigy_10_3362554|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/abhishek/autonlp-prodigy-10-3362554
---
layout: model
title: Thai BertForQuestionAnswering model (from zhufy)
author: John Snow Labs
name: bert_qa_xquad_th_mbert_base
date: 2022-06-02
tags: [th, open_source, question_answering, bert]
task: Question Answering
language: th
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xquad-th-mbert-base` is a Thai model orginally trained by `zhufy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_xquad_th_mbert_base_th_4.0.0_3.0_1654192577829.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_xquad_th_mbert_base_th_4.0.0_3.0_1654192577829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_xquad_th_mbert_base","th") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_xquad_th_mbert_base","th")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("th.answer_question.xquad.multi_lingual_bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_xquad_th_mbert_base|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|th|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/zhufy/xquad-th-mbert-base
- https://github.com/deepmind/xquad
---
layout: model
title: Sentiment Analysis of IMDB Reviews (sentimentdl_use_imdb)
author: John Snow Labs
name: sentimentdl_use_imdb
date: 2021-01-15
task: Sentiment Analysis
language: en
nav_key: models
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, en, sentiment]
supported: true
annotator: SentimentDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Classify IMDB reviews in negative and positive categories using `Universal Sentence Encoder`.
## Predicted Entities
`neg`, `pos`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentimentdl_use_imdb_en_2.7.0_2.4_1610715247685.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentimentdl_use_imdb_en_2.7.0_2.4_1610715247685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
classifier = SentimentDLModel().pretrained('sentimentdl_use_imdb')\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("sentiment")
nlp_pipeline = Pipeline(stages=[document_assembler,
use,
classifier
])
l_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = l_model.fullAnnotate('Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!')
```
{:.nlu-block}
```python
import nlu
nlu.load("en.sentiment.imdb.use.dl").predict("""Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!""")
```
## Results
```bash
| | document | sentiment |
|---:|---------------------------------------------------------------------------------------------------------:|--------------:|
| | Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the | |
| 0 | film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music | positive |
| | was rad! Horror and sword fight freaks,buy this movie now! | |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sentimentdl_use_imdb|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[sentiment]|
|Language:|en|
|Dependencies:|tfhub_use|
## Data Source
This model is trained on data from https://ai.stanford.edu/~amaas/data/sentiment/
## Benchmarking
```bash
precision recall f1-score support
neg 0.88 0.82 0.85 12500
pos 0.84 0.88 0.86 12500
accuracy 0.85 25000
macro avg 0.86 0.86 0.85 25000
weighted avg 0.86 0.85 0.85 25000
```
---
layout: model
title: Pipeline to Extract Granular Anatomical Entities from Oncology Texts
author: John Snow Labs
name: ner_oncology_anatomy_granular_pipeline
date: 2023-03-08
tags: [licensed, clinical, en, oncology, ner, anatomy]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_oncology_anatomy_granular](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_anatomy_granular_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_pipeline_en_4.3.0_3.2_1678286098380.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_pipeline_en_4.3.0_3.2_1678286098380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_oncology_anatomy_granular_pipeline", "en", "clinical/models")
text = '''The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_oncology_anatomy_granular_pipeline", "en", "clinical/models")
val text = "The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:-------------|--------:|------:|:------------|-------------:|
| 0 | left | 36 | 39 | Direction | 0.9981 |
| 1 | breast | 41 | 46 | Site_Breast | 0.9969 |
| 2 | lungs | 82 | 86 | Site_Lung | 0.9978 |
| 3 | liver | 99 | 103 | Site_Liver | 0.9999 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_anatomy_granular_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Legal Duration Clause Binary Classifier
author: John Snow Labs
name: legclf_duration_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `duration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `duration`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_duration_clause_en_1.0.0_3.2_1660123443846.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_duration_clause_en_1.0.0_3.2_1660123443846.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[duration]|
|[other]|
|[other]|
|[duration]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_duration_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
duration 0.97 0.95 0.96 37
other 0.98 0.99 0.98 86
accuracy - - 0.98 123
macro-avg 0.97 0.97 0.97 123
weighted-avg 0.98 0.98 0.98 123
```
---
layout: model
title: Ganda XLMRobertaForTokenClassification Base Cased model (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_luganda
date: 2022-08-13
tags: [lg, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: lg
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-luganda` is a Ganda model originally trained by `mbeukman`.
## Predicted Entities
`ORG`, `LOC`, `PER`, `DATE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luganda_lg_4.1.0_3.0_1660427316470.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luganda_lg_4.1.0_3.0_1660427316470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luganda","lg") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luganda","lg")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_luganda|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|lg|
|Size:|776.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-luganda
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://www.apache.org/licenses/LICENSE-2.0
- https://github.com/Michael-Beukman/NERTransfer
- https://github.com/masakhane-io/masakhane-ner
- https://arxiv.org/pdf/2103.11811.pdf
- https://arxiv.org/abs/2103.11811
- https://arxiv.org/abs/2103.11811
---
layout: model
title: English RoBERTa Embeddings (SCOTUS dataset)
author: John Snow Labs
name: roberta_embeddings_fairlex_scotus_minilm
date: 2022-04-14
tags: [roberta, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `fairlex-scotus-minilm` is a English model orginally trained by `coastalcph`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fairlex_scotus_minilm_en_3.4.2_3.0_1649947447091.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fairlex_scotus_minilm_en_3.4.2_3.0_1649947447091.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_fairlex_scotus_minilm","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_fairlex_scotus_minilm","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.fairlex_scotus_minilm").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_fairlex_scotus_minilm|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|114.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/coastalcph/fairlex-scotus-minilm
- https://coastalcph.github.io
- https://github.com/iliaschalkidis
- https://twitter.com/KiddoThe2B
---
layout: model
title: Portuguese BERT Embeddings (Large Cased)
author: John Snow Labs
name: bert_portuguese_large_cased
date: 2020-11-04
task: Embeddings
language: pt
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, pt]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is the pre-trained BERT model trained on the Portuguese language. `BERT-Base` and `BERT-Large` Cased variants were trained on the `BrWaC` (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_portuguese_large_cased_pt_2.6.0_2.4_1604487922125.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_portuguese_large_cased_pt_2.6.0_2.4_1604487922125.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("bert_portuguese_large_cased", "pt") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['Eu amo PNL']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("bert_portuguese_large_cased", "pt")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("Eu amo PNL").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["Eu amo PNL"]
embeddings_df = nlu.load('pt.bert.cased.large').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token pt_bert_cased_large_embeddings
Eu [0.6893012523651123, 0.18436528742313385, 0.14...
amo [0.6536692976951599, 0.17582201957702637, -0.5...
PNL [-0.1397203803062439, 0.5698696374893188, -0.3...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_portuguese_large_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[pt]|
|Dimension:|1024|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from https://github.com/neuralmind-ai/portuguese-bert
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-512-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10_en_4.0.0_3.0_1657185134431.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10_en_4.0.0_3.0_1657185134431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-512-finetuned-squad-seed-10
---
layout: model
title: English DistilBertForQuestionAnswering model (from anurag0077) Squad3
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_squad3
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad3` is a English model originally trained by `anurag0077`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad3_en_4.0.0_3.0_1654726909717.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad3_en_4.0.0_3.0_1654726909717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad3","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad3","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_anurag0077").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_squad3|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anurag0077/distilbert-base-uncased-finetuned-squad3
---
layout: model
title: Legal Change in control Clause Binary Classifier
author: John Snow Labs
name: legclf_change_in_control_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `change-in-control` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `change-in-control`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_change_in_control_clause_en_1.0.0_3.2_1660123291976.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_change_in_control_clause_en_1.0.0_3.2_1660123291976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[change-in-control]|
|[other]|
|[other]|
|[change-in-control]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_change_in_control_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
legal precision recall f1-score support
change-in-control 0.89 0.96 0.92 25
other 0.99 0.96 0.97 76
accuracy - - 0.96 101
macro-avg 0.94 0.96 0.95 101
weighted-avg 0.96 0.96 0.96 101
```
---
layout: model
title: Vietnamese XlmRoBertaForQuestionAnswering (from bhavikardeshna)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_base_vietnamese
date: 2022-06-23
tags: [vn, open_source, question_answering, xlmroberta]
task: Question Answering
language: vn
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-vietnamese` is a Vietnamese model originally trained by `bhavikardeshna`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_vietnamese_vn_4.0.0_3.0_1655991784070.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_vietnamese_vn_4.0.0_3.0_1655991784070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_vietnamese","vn") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_base_vietnamese","vn")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("vn.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_base_vietnamese|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|vn|
|Size:|880.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/bhavikardeshna/xlm-roberta-base-vietnamese
---
layout: model
title: Detect Drug Chemicals (BertForTokenClassifier)
author: John Snow Labs
name: bert_token_classifier_ner_drugs
date: 2021-09-20
tags: [drug, ner, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.2.0
spark_version: 2.4
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for Drugs. This model is traiend with `BertForTokenClassification` method from `transformers` library and imported into Spark NLP. It detects drug chemicals.
## Predicted Entities
`DrugChem`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_drugs_en_3.2.0_2.4_1632141658042.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_drugs_en_3.2.0_2.4_1632141658042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_drugs", "en", "clinical/models")\
.setInputCols("token", "sentence")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))
test_sentence = """The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes."""
result = model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]})))
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_drugs", "en", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.ner_drugs").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|potassium |DrugChem |
|nucleotide |DrugChem |
|anthracyclines|DrugChem |
|taxanes |DrugChem |
|vinorelbine |DrugChem |
|vinorelbine |DrugChem |
|anthracyclines|DrugChem |
|taxanes |DrugChem |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_drugs|
|Compatibility:|Healthcare NLP 3.2.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Case sensitive:|true|
|Max sentense length:|128|
## Data Source
Trained on i2b2_med7 + FDA. https://www.i2b2.org/NLP/Medication
## Benchmarking
```bash
label precision recall f1-score support
B-DrugChem 0.99 0.99 0.99 97872
I-DrugChem 0.99 0.99 0.99 54909
O 1.00 1.00 1.00 1191109
accuracy - - 1.00 1343890
macro-avg 0.99 0.99 0.99 1343890
weighted-avg 1.00 1.00 1.00 1343890
```
---
layout: model
title: Translate Haitian Creole to English Pipeline
author: John Snow Labs
name: translate_ht_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, ht, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `ht`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ht_en_xx_2.7.0_2.4_1609687007155.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ht_en_xx_2.7.0_2.4_1609687007155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_ht_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_ht_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.ht.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_ht_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForMaskedLM Base Cased model (from model-attribution-challenge)
author: John Snow Labs
name: roberta_embeddings_model_attribution_challenge_base
date: 2022-12-12
tags: [en, open_source, roberta_embeddings, robertaformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base` is a English model originally trained by `model-attribution-challenge`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_model_attribution_challenge_base_en_4.2.4_3.0_1670859033776.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_model_attribution_challenge_base_en_4.2.4_3.0_1670859033776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_model_attribution_challenge_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_model_attribution_challenge_base","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_model_attribution_challenge_base|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|300.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/model-attribution-challenge/roberta-base
- https://arxiv.org/abs/1907.11692
- https://github.com/pytorch/fairseq/tree/master/examples/roberta
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
- https://commoncrawl.org/2016/10/news-dataset-available/
- https://github.com/jcpeterson/openwebtext
- https://arxiv.org/abs/1806.02847
---
layout: model
title: Detect Drugs and posology entities including experimental drugs and cycles (ner_posology_experimental)
author: John Snow Labs
name: ner_posology_experimental
date: 2021-09-01
tags: [licensed, clinical, en, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.1.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model detects drugs, experimental drugs, cyclelength, cyclecount, cycledaty, dosage, form, frequency, duration, route, and drug strength in text. It is based on the core `ner_posology` model, supports additional things like drug cycles, and enriched with more data from clinical trials.
## Predicted Entities
`Administration`, `Cyclenumber`, `Strength`, `Cycleday`, `Duration`, `Cyclecount`, `Route`, `Form`, `Frequency`, `Cyclelength`, `Drug`, `Dosage`
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_experimental_en_3.1.3_3.0_1630511369574.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_experimental_en_3.1.3_3.0_1630511369574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_posology_experimental", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA)..\n\nCalcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body."]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_posology_experimental", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter))
val data = Seq("""Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA)..\n\nCalcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.posology.experimental").predict("""Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA)..\n\nCalcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body.""")
```
## Results
```bash
| | chunk | begin | end | entity |
|---:|:-------------------------|--------:|------:|:---------|
| 0 | Anti-Tac | 15 | 22 | Drug |
| 1 | 10 mCi | 25 | 30 | Dosage |
| 2 | 15 mCi | 108 | 113 | Dosage |
| 3 | yttrium labeled anti-TAC | 118 | 141 | Drug |
| 4 | calcium trisodium Inj | 156 | 176 | Drug |
| 5 | Calcium-DTPA | 191 | 202 | Drug |
| 6 | Ca-DTPA | 205 | 211 | Drug |
| 7 | intravenously | 234 | 246 | Route |
| 8 | Days 1-3 | 251 | 258 | Cycleday |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_posology_experimental|
|Compatibility:|Healthcare NLP 3.1.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
This model is trained on FDA 2018 Medication dataset, enriched with clinical trials data.
## Benchmarking
```bash
label tp fp fn prec rec f1
B-Drug 30260 1321 1630 0.95817107 0.9488868 0.95350635
B-Cycleday 294 1 7 0.99661016 0.9767442 0.9865772
B-Dosage 4019 441 972 0.9011211 0.8052494 0.85049194
I-Strength 21784 2375 1616 0.9016929 0.9309401 0.9160832
I-Cyclenumber 113 2 1 0.9826087 0.9912280 0.98689955
B-Cyclelength 217 3 0 0.98636365 1.0 0.99313504
B-Administration 97 1 5 0.9897959 0.95098037 0.96999997
I-Cyclecount 174 7 3 0.96132594 0.9830508 0.972067
B-Strength 18871 1299 1161 0.9355974 0.9420427 0.93880904
B-Frequency 13064 464 713 0.96570075 0.9482471 0.95689434
B-Cyclenumber 93 2 1 0.97894734 0.9893617 0.9841269
I-Duration 6116 519 738 0.92177844 0.89232564 0.9068129
B-Cyclecount 120 5 3 0.96 0.9756098 0.9677419
B-Form 10964 912 986 0.92320645 0.9174895 0.9203391
I-Route 275 42 51 0.8675079 0.8435583 0.85536546
I-Cyclelength 261 5 0 0.981203 1.0 0.9905123
I-Dosage 2385 471 1107 0.835084 0.6829897 0.75141776
I-Cycleday 548 5 13 0.9909584 0.9768271 0.983842
I-Frequency 18644 967 1574 0.9506909 0.9221486 0.9362023
I-Administration 303 10 5 0.9680511 0.98376626 0.9758454
I-Form 642 284 553 0.6933045 0.5372385 0.6053748
B-Route 5930 280 692 0.9549114 0.8954998 0.92425185
B-Duration 2422 261 359 0.9027208 0.87090975 0.88653
I-Drug 11472 1066 1240 0.9149784 0.9024544 0.9086733
Macro-average 149068 10743 13430 0.93426394 0.9111479 0.92256117
Micro-average 149068 10743 13430 0.93277687 0.91735286 0.9250006
```
---
layout: model
title: GloVe Embeddings 840B 300 (Multilingual)
author: John Snow Labs
name: glove_840B_300
date: 2020-01-22
task: Embeddings
language: xx
edition: Spark NLP 2.4.0
spark_version: 2.4
tags: [open_source, embeddings]
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
GloVe (Global Vectors) is a model for distributed word representation. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. It outperformed many common Word2vec models on the word analogy task. One benefit of GloVe is that it is the result of directly modeling relationships, instead of getting them as a side effect of training a language model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/glove_840B_300_xx_2.4.0_2.4_1579698926752.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/glove_840B_300_xx_2.4.0_2.4_1579698926752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love Spark NLP']], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""I love Spark NLP"""]
glove_df = nlu.load('xx.embed.glove.840B_300').predict(text)
glove_df
```
{:.h2_title}
## Results
```bash
token | glove_embeddings |
-------|----------------------------------------------------|
I | [0.1941000074148178, 0.22603000700473785, -0.4...] |
love | [0.13948999345302582, 0.534529983997345, -0.25...] |
Spark | [0.20353999733924866, 0.6292600035667419, 0.27...] |
NLP | [0.059436000883579254, 0.18411000072956085, -0...] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|glove_840B_300|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.4.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[xx]|
|Dimension:|300|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from [https://nlp.stanford.edu/projects/glove/](https://nlp.stanford.edu/projects/glove/)
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_by_samantharhay TFWav2Vec2ForCTC from samantharhay
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab_by_samantharhay
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_samantharhay` is a English model originally trained by samantharhay.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_samantharhay_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_samantharhay_en_4.2.0_3.0_1664102943776.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_samantharhay_en_4.2.0_3.0_1664102943776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab_by_samantharhay", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab_by_samantharhay", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab_by_samantharhay|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|354.8 MB|
---
layout: model
title: Spanish BertForQuestionAnswering model (from mrm8488)
author: John Snow Labs
name: bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488
date: 2022-06-03
tags: [es, open_source, question_answering, bert]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es` is a Spanish model orginally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488_es_4.0.0_3.0_1654249847218.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488_es_4.0.0_3.0_1654249847218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squadv2.bert.distilled_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|410.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es
- https://github.com/dccuchile/beto
- https://twitter.com/mrm8488
- https://github.com/ccasimiro88/TranslateAlignRetrieve
- https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Huggingface_pipelines_demo.ipynb
- https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Using_Spanish_BERT_fine_tuned_for_Q%26A_pipelines.ipynb
---
layout: model
title: Detect Clinical Conditions (ner_eu_clinical_case - fr)
author: John Snow Labs
name: ner_eu_clinical_condition
date: 2023-02-06
tags: [fr, clinical, licensed, ner, clinical_condition]
task: Named Entity Recognition
language: fr
edition: Healthcare NLP 4.2.8
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition (NER) deep learning model for extracting clinical conditions from French texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN.
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Predicted Entities
`clinical_condition`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_fr_4.2.8_3.0_1675725809666.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_fr_4.2.8_3.0_1675725809666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fr")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained('ner_eu_clinical_condition', "fr", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["""Il aurait présenté il y’ a environ 30 ans des ulcérations génitales non traitées spontanément guéries. L’interrogatoire retrouvait une toux sèche depuis trois mois, des douleurs rétro-sternales constrictives, une dyspnée stade III de la NYHA et un contexte d’ apyrexie. Sur ce tableau s’ est greffé des œdèmes des membres inférieurs puis un tableau d’ anasarque d’ où son hospitalisation en cardiologie pour décompensation cardiaque globale."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fr")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_condition", "fr", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter))
val data = Seq(Array("""Il aurait présenté il y’ a environ 30 ans des ulcérations génitales non traitées spontanément guéries. L’interrogatoire retrouvait une toux sèche depuis trois mois, des douleurs rétro-sternales constrictives, une dyspnée stade III de la NYHA et un contexte d’ apyrexie. Sur ce tableau s’ est greffé des œdèmes des membres inférieurs puis un tableau d’ anasarque d’ où son hospitalisation en cardiologie pour décompensation cardiaque globale.""")).toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+------------------------+------------------+
|chunk |ner_label |
+------------------------+------------------+
|ulcérations |clinical_condition|
|toux sèche |clinical_condition|
|douleurs |clinical_condition|
|dyspnée |clinical_condition|
|apyrexie |clinical_condition|
|anasarque |clinical_condition|
|décompensation cardiaque|clinical_condition|
+------------------------+------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_eu_clinical_condition|
|Compatibility:|Healthcare NLP 4.2.8+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|fr|
|Size:|899.9 KB|
## References
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Benchmarking
```bash
label tp fp fn total precision recall f1
clinical_event 269.0 51.0 52.0 321.0 0.8406 0.8380 0.8393
macro - - - - - - 0.8393
micro - - - - - - 0.8393
```
---
layout: model
title: Turkish BertForQuestionAnswering Cased model (from enelpi)
author: John Snow Labs
name: bert_qa_question_answering_cased_squadv2
date: 2022-07-07
tags: [tr, open_source, bert, question_answering]
task: Question Answering
language: tr
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-question-answering-cased-squadv2_tr` is a Turkish model originally trained by `enelpi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_question_answering_cased_squadv2_tr_4.0.0_3.0_1657187898372.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_question_answering_cased_squadv2_tr_4.0.0_3.0_1657187898372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_question_answering_cased_squadv2","tr") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_question_answering_cased_squadv2","tr")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_question_answering_cased_squadv2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|tr|
|Size:|413.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/enelpi/bert-question-answering-cased-squadv2_tr
---
layout: model
title: English BertForMaskedLM Base Cased model
author: John Snow Labs
name: bert_embeddings_base_cased
date: 2022-12-02
tags: [en, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased` is a English model originally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_cased_en_4.2.4_3.0_1670016286064.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_cased_en_4.2.4_3.0_1670016286064.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_cased","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_cased","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|406.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/bert-base-cased
- https://arxiv.org/abs/1810.04805
- https://github.com/google-research/bert
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
---
layout: model
title: Smaller BERT Sentence Embeddings (L-8_H-256_A-4)
author: John Snow Labs
name: sent_small_bert_L8_256
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_256_en_2.6.0_2.4_1598350433990.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_256_en_2.6.0_2.4_1598350433990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_256", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_256", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.small_bert_L8_256').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
sentence en_embed_sentence_small_bert_L8_256_embeddings
I hate cancer [-0.04690948873758316, 0.5517814755439758, 0.7...
Antibiotics aren't painkiller [0.4066215753555298, 0.48149049282073975, 0.18...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_small_bert_L8_256|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|256|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1
---
layout: model
title: Translate Hungarian to English Pipeline
author: John Snow Labs
name: translate_hu_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, hu, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `hu`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_hu_en_xx_2.7.0_2.4_1609688503781.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_hu_en_xx_2.7.0_2.4_1609688503781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_hu_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_hu_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.hu.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_hu_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Stopwords Remover for Tamil language (125 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, ta, open_source]
task: Stop Words Removal
language: ta
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ta_3.4.1_3.0_1646673010096.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ta_3.4.1_3.0_1646673010096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","ta") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["நீங்கள் என்னை விட நன்றாக இல்லை"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ta")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("நீங்கள் என்னை விட நன்றாக இல்லை").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ta.stopwords").predict("""நீங்கள் என்னை விட நன்றாக இல்லை""")
```
## Results
```bash
+-------------------------------+
|result |
+-------------------------------+
|[நீங்கள், என்னை, நன்றாக, இல்லை]|
+-------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|ta|
|Size:|2.0 KB|
---
layout: model
title: Fast Neural Machine Translation Model from Punjabi (Eastern) to English
author: John Snow Labs
name: opus_mt_pa_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pa, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `pa`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_pa_en_xx_2.7.0_2.4_1609164148817.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_pa_en_xx_2.7.0_2.4_1609164148817.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_pa_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_pa_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.pa.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_pa_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering Large Cased model (from dmis-lab)
author: John Snow Labs
name: bert_qa_biobert_large_cased_v1.1_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert-large-cased-v1.1-squad` is a English model originally trained by `dmis-lab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_large_cased_v1.1_squad_en_4.0.0_3.0_1657189073455.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_large_cased_v1.1_squad_en_4.0.0_3.0_1657189073455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_large_cased_v1.1_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_biobert_large_cased_v1.1_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_biobert_large_cased_v1.1_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/dmis-lab/biobert-large-cased-v1.1-squad
---
layout: model
title: Legal Amendments and waivers Clause Binary Classifier (md)
author: John Snow Labs
name: legclf_amendments_and_waivers_md
date: 2023-01-11
tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `amendments-and-waivers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `amendments-and-waivers`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_amendments_and_waivers_md_en_1.0.0_3.0_1673460275552.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_amendments_and_waivers_md_en_1.0.0_3.0_1673460275552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[amendments-and-waivers]|
|[other]|
|[other]|
|[amendments-and-waivers]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_amendments_and_waivers_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
precision recall f1-score support
effect-of-termination 1.00 0.82 0.90 38
other 0.85 1.00 0.92 39
accuracy 0.91 77
macro avg 0.92 0.91 0.91 77
weighted avg 0.92 0.91 0.91 77
```
---
layout: model
title: Spanish Named Entity Recognition, (RoBERTa base trained with data from the National Library of Spain (BNE) and CONLL 2003 data), by the TEMU Unit of the BSC-CNS
author: cayorodriguez
name: roberta_base_bne_conll_ner_spark_nlp
date: 2022-11-21
tags: [es, open_source]
task: Named Entity Recognition
language: es
edition: Spark NLP 4.0.0
spark_version: 3.2
supported: false
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. roberta-base-bne-conll-ner_spark_nlp is a Spanish model orginally trained by TEMU-BSC for PlanTL-GOB-ES.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/cayorodriguez/roberta_base_bne_conll_ner_spark_nlp_es_4.0.0_3.2_1669018824287.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/cayorodriguez/roberta_base_bne_conll_ner_spark_nlp_es_4.0.0_3.2_1669018824287.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
ner = RoBertaForTokenClassification.pretrained("roberta_base_bne_conll_ner_spark_nlp","es") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, ner])
data = spark.createDataFrame([["El Plan Nacional para el Impulso de las Tecnologías del Lenguage es una iniciativa del Gobierno de España"]]).toDF("text")
result = pipeline.fit(data).transform(data)
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
ner = RoBertaForTokenClassification.pretrained("roberta_base_bne_conll_ner_spark_nlp","es") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, ner])
data = spark.createDataFrame([["El Plan Nacional para el Impulso de las Tecnologías del Lenguage es una iniciativa del Gobierno de España"]]).toDF("text")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_base_bne_conll_ner_spark_nlp|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Community|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|447.3 MB|
|Case sensitive:|true|
|Max sentence length:|128|
---
layout: model
title: BERT Sequence Classifier - Classify the Music Genre
author: John Snow Labs
name: bert_sequence_classifier_song_lyrics
date: 2021-11-07
tags: [song, lyrics, en, bert_for_sequence_classification, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 3.3.2
spark_version: 2.4
supported: true
annotator: BertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is imported from `Hugging Face-models` and it classifies the music genre into 6 classes.
## Predicted Entities
`Dance`, `Heavy Metal`, `Hip Hop`, `Indie`, `Pop`, `Rock`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_song_lyrics_en_3.3.2_2.4_1636283685615.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_song_lyrics_en_3.3.2_2.4_1636283685615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_song_lyrics', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)
pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])
example = spark.createDataFrame([["""Because you need me Every single day Trying to find me But you don't know why Trying to find me again But you don't know how Trying to find me again Every single day"""]]).toDF("text")
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForSequenceClassification("bert_sequence_classifier_song_lyrics", "en")
.setInputCols("document", "token")
.setOutputCol("class")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val example = Seq.empty["""Because you need me Every single day Trying to find me But you don't know why Trying to find me again But you don't know how Trying to find me again Every single day"""].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.song_lyrics").predict("""Because you need me Every single day Trying to find me But you don't know why Trying to find me again But you don't know how Trying to find me again Every single day""")
```
## Results
```bash
['Rock']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_song_lyrics|
|Compatibility:|Spark NLP 3.3.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[label]|
|Language:|en|
|Case sensitive:|true|
## Data Source
[https://huggingface.co/juliensimon/autonlp-song-lyrics-18753417](https://huggingface.co/juliensimon/autonlp-song-lyrics-18753417)
## Benchmarking
```bash
+--------------------+----------+
| Validation Metrics | Score |
+--------------------+----------+
| Loss | 0.906597 |
| Accuracy | 0.668027 |
| Macro F1 | 0.538484 |
| Micro F1 | 0.668027 |
| Weighted F1 | 0.64147 |
| Macro Precision | 0.67444 |
| Micro Precision | 0.668027 |
| Weighted Precision | 0.663409 |
| Macro Recall | 0.50784 |
| Micro Recall | 0.668027 |
| Weighted Recall | 0.668027 |
+--------------------+----------+
```
---
layout: model
title: RCT Binary Classifier (BioBERT Sentence Embeddings)
author: John Snow Labs
name: rct_binary_classifier_biobert
date: 2022-05-27
tags: [licensed, rct, clinical, classifier, en]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 3.4.2
spark_version: 3.0
supported: true
annotator: ClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a BioBERT based classifier that can classify if an article is a randomized clinical trial (RCT) or not.
## Predicted Entities
`true`, `false`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_RCT/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLASSIFICATION_RCT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_biobert_en_3.4.2_3.0_1653668780966.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_biobert_en_3.4.2_3.0_1653668780966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
bert_sent = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
classifier_dl = ClassifierDLModel.pretrained("rct_binary_classifier_biobert", "en", "clinical/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("class")
biobert_clf_pipeline = Pipeline(
stages = [
document_assembler,
bert_sent,
classifier_dl
])
data = spark.createDataFrame([["""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """]]).toDF("text")
result = biobert_clf_pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val bert_sent = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val classifier_dl = ClassifierDLModel.pretrained("rct_binary_classifier_biobert", "en", "clinical/models")
.setInputCols(Array("sentence_embeddings"))
.setOutputCol("class")
val biobert_clf_pipeline = new Pipeline().setStages(Array(documenter, bert_sent, classifier_dl))
val data = Seq("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """).toDS.toDF("text")
val result = biobert_clf_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.rct_binary_biobert").predict("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """)
```
## Results
```bash
| text | rct |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
| Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. | true |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|rct_binary_classifier_biobert|
|Compatibility:|Healthcare NLP 3.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
https://arxiv.org/abs/1710.06071
## Benchmarking
```bash
label precision recall f1-score support
false 0.86 0.81 0.84 2915
true 0.80 0.85 0.83 2545
accuracy - - 0.83 5460
macro-avg 0.83 0.83 0.83 5460
weighted-avg 0.83 0.83 0.83 5460
```
---
layout: model
title: Fast Neural Machine Translation Model from Oromo to English
author: John Snow Labs
name: opus_mt_om_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, om, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `om`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_om_en_xx_2.7.0_2.4_1609169131097.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_om_en_xx_2.7.0_2.4_1609169131097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_om_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_om_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.om.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_om_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Bemba (Zambia) asr_wav2vec2_large_xls_r_300m_bemba_fds TFWav2Vec2ForCTC from csikasote
author: John Snow Labs
name: asr_wav2vec2_large_xls_r_300m_bemba_fds
date: 2022-09-24
tags: [wav2vec2, bem, audio, open_source, asr]
task: Automatic Speech Recognition
language: bem
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_bemba_fds` is a Bemba (Zambia) model originally trained by csikasote.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_bemba_fds_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_bemba_fds_bem_4.2.0_3.0_1664023896000.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_bemba_fds_bem_4.2.0_3.0_1664023896000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xls_r_300m_bemba_fds", "bem")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xls_r_300m_bemba_fds", "bem")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xls_r_300m_bemba_fds|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|bem|
|Size:|1.2 GB|
---
layout: model
title: Detect Clinical Entities (BertForTokenClassifier)
author: John Snow Labs
name: bert_token_classifier_ner_jsl
date: 2022-03-21
tags: [ner_jsl, ner, berfortokenclassification, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terminology. This model is trained with `BertForTokenClassification` method from `transformers` library and imported into Spark NLP. It detects 77 entities.
Definitions of Predicted Entities:
- `Medical_Device`: All mentions related to medical devices and supplies.
- `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient.
- `Allergen`: Allergen related extractions mentioned in the document.
- `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients.
- `Clinical_Dept`: Terms that indicate the medical and/or surgical departments.
- `Symptom`: All the symptoms mentioned in the document, of a patient or someone else.
- `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye.
- `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient.
- `Age`: All mention of ages, past or present, related to the patient or with anybody else.
- `Birth_Entity`: Mentions that indicate giving birth.
- `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else.
- `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs).
- `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included).
- `Test`: Mentions of laboratory, pathology, and radiological tests.
- `Procedure`: All mentions of invasive medical or surgical procedures or treatments.
- `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure").
- `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.).
## Predicted Entities
`Medical_Device`, `Physical_Measurement`, `Alergen`, `Procedure`, `Substance_Quantity`, `Drug`, `Test_Result`, `Pregnancy_Newborn`, `Admission_Discharge`, `Demographics`, `Lifestyle`, `Header`, `Date_Time`, `Treatment`, `Clinical_Dept`, `Test`, `Death_Entity`, `Age`, `Oncological`, `Body_Part`, `Birth_Entity`, `Vital_Sign`, `Symptom`, `Disease_Syndrome_Disorder`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.3.4_2.4_1647895738040.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.3.4_2.4_1647895738040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models")\
.setInputCols(["token", "sentence"])\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter])
sample_text = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""
df = spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(df).transform(df)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter))
val sample_text = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text")
val result = pipeline.fit(sample_text).transform(sample_text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.ner_jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_baseline", "ab")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_baseline", "ab")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_baseline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|ab|
|Size:|446.6 KB|
---
layout: model
title: Translate English to Malagasy Pipeline
author: John Snow Labs
name: translate_en_mg
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, mg, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `mg`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mg_xx_2.7.0_2.4_1609687980137.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mg_xx_2.7.0_2.4_1609687980137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_mg", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_mg", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.mg').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_mg|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Detect Clinical Entities (ner_jsl_biobert)
author: John Snow Labs
name: ner_jsl_biobert_pipeline
date: 2023-03-20
tags: [clinical, licensed, en, ner]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_jsl_biobert](https://nlp.johnsnowlabs.com/2021/09/05/ner_jsl_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_pipeline_en_4.3.0_3.2_1679309924530.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_pipeline_en_4.3.0_3.2_1679309924530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_jsl_biobert_pipeline", "en", "clinical/models")
text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_jsl_biobert_pipeline", "en", "clinical/models")
val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.jsl_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------------------------------|--------:|------:|:-----------------------------|-------------:|
| 0 | 21-day-old | 17 | 26 | Age | 1 |
| 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9304 |
| 2 | male | 38 | 41 | Gender | 1 |
| 3 | for 2 days | 48 | 57 | Duration | 0.6477 |
| 4 | congestion | 62 | 71 | Symptom | 0.7325 |
| 5 | mom | 75 | 77 | Gender | 0.9995 |
| 6 | suctioning | 88 | 97 | Modifier | 0.1445 |
| 7 | yellow discharge | 99 | 114 | Symptom | 0.43875 |
| 8 | nares | 135 | 139 | External_body_part_or_region | 0.9005 |
| 9 | she | 147 | 149 | Gender | 0.9956 |
| 10 | mild | 168 | 171 | Modifier | 0.5113 |
| 11 | problems with his breathing while feeding | 173 | 213 | Symptom | 0.4362 |
| 12 | perioral cyanosis | 237 | 253 | Symptom | 0.76325 |
| 13 | retractions | 258 | 268 | Symptom | 0.9819 |
| 14 | One day ago | 272 | 282 | RelativeDate | 0.838267 |
| 15 | mom | 285 | 287 | Gender | 0.9995 |
| 16 | tactile temperature | 304 | 322 | Symptom | 0.5194 |
| 17 | Tylenol | 345 | 351 | Drug_BrandName | 0.9999 |
| 18 | Baby | 354 | 357 | Age | 0.9997 |
| 19 | decreased p.o | 377 | 389 | Symptom | 0.445 |
| 20 | His | 400 | 402 | Gender | 0.9996 |
| 21 | from 20 minutes q.2h. to 5 to 10 minutes | 434 | 473 | Duration | 0.24581 |
| 22 | his | 488 | 490 | Gender | 0.9573 |
| 23 | respiratory congestion | 492 | 513 | Symptom | 0.5144 |
| 24 | He | 516 | 517 | Gender | 1 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_jsl_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.9 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Legal Non Competition Clause Binary Classifier
author: John Snow Labs
name: legclf_non_comp_clause
date: 2023-02-13
tags: [en, legal, classification, non_competition, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `non_comp` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`non_comp`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_comp_clause_en_1.0.0_3.0_1676304359955.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_comp_clause_en_1.0.0_3.0_1676304359955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[non_comp]|
|[other]|
|[other]|
|[non_comp]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_non_comp_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
non_comp 1.00 1.00 1.00 15
other 1.00 1.00 1.00 7
accuracy - - 1.00 22
macro-avg 1.00 1.00 1.00 22
weighted-avg 1.00 1.00 1.00 22
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from Nadhiya)
author: John Snow Labs
name: distilbert_qa_Nadhiya_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Nadhiya`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Nadhiya_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724293769.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Nadhiya_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724293769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Nadhiya_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Nadhiya_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Nadhiya").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_Nadhiya_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Nadhiya/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Legal Noncompetition Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_noncompetition_agreement_bert
date: 2023-01-29
tags: [en, legal, classification, noncompetition, agreement, licensed, bert, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_noncompetition_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `noncompetition-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`noncompetition-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_noncompetition_agreement_bert_en_1.0.0_3.0_1674990641933.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_noncompetition_agreement_bert_en_1.0.0_3.0_1674990641933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[noncompetition-agreement]|
|[other]|
|[other]|
|[noncompetition-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_noncompetition_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
noncompetition-agreement 0.97 0.97 0.97 32
other 0.98 0.98 0.98 55
accuracy - - 0.98 87
macro-avg 0.98 0.98 0.98 87
weighted-avg 0.98 0.98 0.98 87
```
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_english_colab TFWav2Vec2ForCTC from shacharm
author: John Snow Labs
name: asr_wav2vec2_large_xls_r_300m_english_colab
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_english_colab` is a English model originally trained by shacharm.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_english_colab_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_english_colab_en_4.2.0_3.0_1664103475912.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_english_colab_en_4.2.0_3.0_1664103475912.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xls_r_300m_english_colab", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xls_r_300m_english_colab", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xls_r_300m_english_colab|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Italian Embeddings (Base, Recipees)
author: John Snow Labs
name: bert_embeddings_chefberto_italian_cased
date: 2022-04-11
tags: [bert, embeddings, it, open_source]
task: Embeddings
language: it
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chefberto-italian-cased` is a Italian model orginally trained by `vinhood`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chefberto_italian_cased_it_3.4.2_3.0_1649676831699.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chefberto_italian_cased_it_3.4.2_3.0_1649676831699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_chefberto_italian_cased","it") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_chefberto_italian_cased","it")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Adoro Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("it.embed.chefberto_italian_cased").predict("""Adoro Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chefberto_italian_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|it|
|Size:|415.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/vinhood/chefberto-italian-cased
- https://twitter.com/denocris
- https://www.linkedin.com/in/cristiano-de-nobili/
- https://www.vinhood.com/en/
---
layout: model
title: English BertForMaskedLM Base Cased model (from VMware)
author: John Snow Labs
name: bert_embeddings_v_2021_base
date: 2022-12-02
tags: [en, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vbert-2021-base` is a English model originally trained by `VMware`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_base_en_4.2.4_3.0_1670022938608.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_base_en_4.2.4_3.0_1670022938608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_base","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_v_2021_base|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|409.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/VMware/vbert-2021-base
- https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99
---
layout: model
title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18)
author: John Snow Labs
name: roberta_qa_base_spanish_squades_becasincentivos3
date: 2023-01-20
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becasIncentivos3` is a Spanish model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos3_es_4.3.0_3.0_1674218087235.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos3_es_4.3.0_3.0_1674218087235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos3","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos3","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_spanish_squades_becasincentivos3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|459.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becasIncentivos3
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465524
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465524` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465524_en_4.0.0_3.0_1655987091352.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465524_en_4.0.0_3.0_1655987091352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465524","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465524","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465524.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465524|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|887.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465524
---
layout: model
title: Vietnamese Bert Embeddings
author: John Snow Labs
name: bert_embeddings_bert_base_vi_cased
date: 2022-04-11
tags: [bert, embeddings, vi, open_source]
task: Embeddings
language: vi
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-vi-cased` is a Vietnamese model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_vi_cased_vi_3.4.2_3.0_1649676357396.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_vi_cased_vi_3.4.2_3.0_1649676357396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_vi_cased","vi") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Tôi yêu Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_vi_cased","vi")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Tôi yêu Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("vi.embed.bert_cased").predict("""Tôi yêu Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_vi_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|vi|
|Size:|373.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-vi-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Normalizing Section Headers in Clinical Notes
author: John Snow Labs
name: normalized_section_header_mapper
date: 2022-04-04
tags: [en, chunkmapper, chunkmapping, normalizer, sectionheader, licensed, clinical]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 3.4.2
spark_version: 3.0
supported: true
annotator: NotDefined
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline normalizes the section headers in clinical notes. It returns two levels of normalization called `level_1` and `level_2`.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NORMALIZED_SECTION_HEADER_MAPPER/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NORMALIZED_SECTION_HEADER_MAPPER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/normalized_section_header_mapper_en_3.4.2_3.0_1649098646707.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/normalized_section_header_mapper_en_3.4.2_3.0_1649098646707.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en","clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models")\
.setInputCols(["sentence","token", "word_embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["Header"])
chunkerMapper = ChunkMapperModel.pretrained("normalized_section_header_mapper", "en", "clinical/models") \
.setInputCols("ner_chunk")\
.setOutputCol("mappings")\
.setRel("level_1") #or level_2
pipeline = Pipeline().setStages([document_assembler,
sentence_detector,
tokenizer,
embeddings,
clinical_ner,
ner_converter,
chunkerMapper])
sentences = """ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.
PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.
GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface."""
test_data = spark.createDataFrame([[sentences]]).toDF("text")
result = pipeline.fit(test_data).transform(test_data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en","clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models")
.setInputCols(Array("sentence","token", "word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Header"))
val chunkerMapper = ChunkMapperModel.pretrained("normalized_section_header_mapper", "en", "clinical/models")
.setInputCols("ner_chunk")
.setOutputCol("mappings")
.setRel("level_1") #or level_2
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
embeddings,
clinical_ner,
ner_converter,
chunkerMapper))
val test_sentence= """ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.
PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.
GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface."""
val test_data = Seq(test_sentence).toDS.toDF("text")
val result = pipeline.fit(test_data).transform(test_data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.section_headers_normalized").predict("""ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.
PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.
GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.""")
```
## Results
```bash
+-------------------+------------------+
|section |normalized_section|
+-------------------+------------------+
|ADMISSION DIAGNOSIS|DIAGNOSIS |
|PRINCIPAL DIAGNOSIS|DIAGNOSIS |
|GENERAL REVIEW |REVIEW TYPE |
+-------------------+------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|normalized_section_header_mapper|
|Compatibility:|Healthcare NLP 3.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|14.2 KB|
---
layout: model
title: Translate Tuvaluan to English Pipeline
author: John Snow Labs
name: translate_tvl_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, tvl, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `tvl`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_tvl_en_xx_2.7.0_2.4_1609690444126.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_tvl_en_xx_2.7.0_2.4_1609690444126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_tvl_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_tvl_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.tvl.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_tvl_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Integration Clause Binary Classifier
author: John Snow Labs
name: legclf_integration_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `integration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `integration`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_integration_clause_en_1.0.0_3.2_1660122564699.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_integration_clause_en_1.0.0_3.2_1660122564699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[integration]|
|[other]|
|[other]|
|[integration]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_integration_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
integration 0.93 0.73 0.82 37
other 0.92 0.98 0.95 118
accuracy - - 0.92 155
macro-avg 0.93 0.86 0.88 155
weighted-avg 0.92 0.92 0.92 155
```
---
layout: model
title: English asr_Quran_speech_recognizer TFWav2Vec2ForCTC from Nuwaisir
author: John Snow Labs
name: asr_Quran_speech_recognizer
date: 2022-09-26
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Quran_speech_recognizer` is a English model originally trained by Nuwaisir.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Quran_speech_recognizer_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Quran_speech_recognizer_en_4.2.0_3.0_1664208158710.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Quran_speech_recognizer_en_4.2.0_3.0_1664208158710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_Quran_speech_recognizer", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_Quran_speech_recognizer", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_Quran_speech_recognizer|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Detect Clinical Conditions (ner_eu_clinical_condition)
author: John Snow Labs
name: ner_eu_clinical_condition
date: 2023-02-06
tags: [en, clinical, licensed, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.8
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition (NER) deep learning model for clinical conditions. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN.
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Predicted Entities
`clinical_condition`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_en_4.2.8_3.0_1675718793293.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_en_4.2.8_3.0_1675718793293.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained('ner_eu_clinical_condition', "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["""Hyperparathyroidism was considered upon the fourth occasion. The history of weakness and generalized joint pains were present. He also had history of epigastric pain diagnosed informally as gastritis. He had previously had open reduction and internal fixation for the initial two fractures under general anesthesia. He sustained mandibular fracture."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_condition", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter))
val data = Seq(Array("""Hyperparathyroidism was considered upon the fourth occasion. The history of weakness and generalized joint pains were present. He also had history of epigastric pain diagnosed informally as gastritis. He had previously had open reduction and internal fixation for the initial two fractures under general anesthesia. He sustained mandibular fracture.""")).toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+-----------------------+------------------+
|chunk |ner_label |
+-----------------------+------------------+
|Hyperparathyroidism |clinical_condition|
|weakness |clinical_condition|
|generalized joint pains|clinical_condition|
|epigastric pain |clinical_condition|
|gastritis |clinical_condition|
|fractures |clinical_condition|
|anesthesia |clinical_condition|
|mandibular fracture |clinical_condition|
+-----------------------+------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_eu_clinical_condition|
|Compatibility:|Healthcare NLP 4.2.8+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|851.3 KB|
## References
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Benchmarking
```bash
label tp fp fn total precision recall f1
clinical_event 230.0 28.0 70.0 300.0 0.8915 0.7667 0.8244
macro - - - - - - 0.8244
micro - - - - - - 0.8244
```
---
layout: model
title: English RobertaForQuestionAnswering (from mvonwyl)
author: John Snow Labs
name: roberta_qa_roberta_base_finetuned_squad2
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad2` is a English model originally trained by `mvonwyl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad2_en_4.0.0_3.0_1655734553755.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad2_en_4.0.0_3.0_1655734553755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_finetuned_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base.by_mvonwyl").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_finetuned_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mvonwyl/roberta-base-finetuned-squad2
---
layout: model
title: English RobertaForQuestionAnswering Large Cased model (from susghosh)
author: John Snow Labs
name: roberta_qa_large_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad` is a English model originally trained by `susghosh`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_squad_en_4.3.0_3.0_1674221913718.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_squad_en_4.3.0_3.0_1674221913718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_large_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/susghosh/roberta-large-squad
---
layout: model
title: Translate Bemba (Zambia) to English Pipeline
author: John Snow Labs
name: translate_bem_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, bem, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `bem`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bem_en_xx_2.7.0_2.4_1609701800337.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bem_en_xx_2.7.0_2.4_1609701800337.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_bem_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_bem_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.bem.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_bem_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Resolve CVX Codes
author: John Snow Labs
name: cvx_resolver_pipeline
date: 2022-10-12
tags: [en, licensed, clinical, resolver, chunk_mapping, cvx, pipeline]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 4.2.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline maps entities with their corresponding CVX codes. You’ll just feed your text and it will return the corresponding CVX codes.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/cvx_resolver_pipeline_en_4.2.1_3.0_1665611325640.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/cvx_resolver_pipeline_en_4.2.1_3.0_1665611325640.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
resolver_pipeline = PretrainedPipeline("cvx_resolver_pipeline", "en", "clinical/models")
text= "The patient has a history of influenza vaccine, tetanus and DTaP"
result = resolver_pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val resolver_pipeline = new PretrainedPipeline("cvx_resolver_pipeline", "en", "clinical/models")
val result = resolver_pipeline.fullAnnotate("The patient has a history of influenza vaccine, tetanus and DTaP")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.cvx_pipeline").predict("""The patient has a history of influenza vaccine, tetanus and DTaP""")
```
## Results
```bash
+-----------------+---------+--------+
|chunk |ner_chunk|cvx_code|
+-----------------+---------+--------+
|influenza vaccine|Vaccine |160 |
|tetanus |Vaccine |35 |
|DTaP |Vaccine |20 |
+-----------------+---------+--------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|cvx_resolver_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.2.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|2.1 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
- ChunkMapperModel
- ChunkMapperFilterer
- Chunk2Doc
- BertSentenceEmbeddings
- SentenceEntityResolverModel
- ResolverMerger
---
layout: model
title: Legal Electrical And Nuclear Industries Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_electrical_and_nuclear_industries_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, electrical_and_nuclear_industries, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_electrical_and_nuclear_industries_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Electrical_and_Nuclear_Industries or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Electrical_and_Nuclear_Industries`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_electrical_and_nuclear_industries_bert_en_1.0.0_3.0_1678111896903.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_electrical_and_nuclear_industries_bert_en_1.0.0_3.0_1678111896903.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Electrical_and_Nuclear_Industries]|
|[Other]|
|[Other]|
|[Electrical_and_Nuclear_Industries]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_electrical_and_nuclear_industries_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Electrical_and_Nuclear_Industries 0.82 0.94 0.88 34
Other 0.95 0.85 0.90 46
accuracy - - 0.89 80
macro-avg 0.89 0.89 0.89 80
weighted-avg 0.90 0.89 0.89 80
```
---
layout: model
title: Spanish RobertaForQuestionAnswering (from mrm8488)
author: John Snow Labs
name: roberta_qa_longformer_base_4096_spanish_finetuned_squad
date: 2022-06-20
tags: [es, open_source, question_answering, roberta]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer-base-4096-spanish-finetuned-squad` is a Spanish model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_longformer_base_4096_spanish_finetuned_squad_es_4.0.0_3.0_1655728985385.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_longformer_base_4096_spanish_finetuned_squad_es_4.0.0_3.0_1655728985385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_longformer_base_4096_spanish_finetuned_squad","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_longformer_base_4096_spanish_finetuned_squad","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squad.roberta.base_4096.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_longformer_base_4096_spanish_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|es|
|Size:|473.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/longformer-base-4096-spanish-finetuned-squad
- https://creativecommons.org/licenses/by/4.0/legalcode
- https://es.wikinews.org/
- https://creativecommons.org/licenses/by/2.5/
- https://es.wikipedia.org/
- https://creativecommons.org/licenses/by-sa/3.0/legalcode
- https://twitter.com/mrm8488
- https://www.narrativa.com/
- http://clic.ub.edu/corpus/en
---
layout: model
title: Legal Use of proceeds Clause Binary Classifier
author: John Snow Labs
name: legclf_use_of_proceeds_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `use-of-proceeds` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `use-of-proceeds`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_use_of_proceeds_clause_en_1.0.0_3.2_1660123170841.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_use_of_proceeds_clause_en_1.0.0_3.2_1660123170841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[use-of-proceeds]|
|[other]|
|[other]|
|[use-of-proceeds]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_use_of_proceeds_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.99 1.00 1.00 112
use-of-proceeds 1.00 0.98 0.99 43
accuracy - - 0.99 155
macro-avg 1.00 0.99 0.99 155
weighted-avg 0.99 0.99 0.99 155
```
---
layout: model
title: Word2Vec Embeddings in Macedonian (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, mk, open_source]
task: Embeddings
language: mk
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mk_3.4.1_3.0_1647443888318.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mk_3.4.1_3.0_1647443888318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mk") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Сакам искра НЛП"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mk")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Сакам искра НЛП").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("mk.embed.w2v_cc_300d").predict("""Сакам искра НЛП""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|mk|
|Size:|788.2 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Translate English to Morisyen Pipeline
author: John Snow Labs
name: translate_en_mfe
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, mfe, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `mfe`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mfe_xx_2.7.0_2.4_1609690383378.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mfe_xx_2.7.0_2.4_1609690383378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_mfe", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_mfe", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.mfe').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_mfe|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Extract Cancer Therapies and Posology Information
author: John Snow Labs
name: ner_oncology_unspecific_posology
date: 2022-11-24
tags: [licensed, clinical, oncology, en, ner, treatment, posology]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts mentions of treatments and posology information using unspecific labels (low granularity).
Definitions of Predicted Entities:
- `Cancer_Therapy`: Mentions of cancer treatments, including chemotherapy, radiotherapy, surgery and other.
- `Posology_Information`: Terms related to the posology of the treatment, including duration, frequencies and dosage.
## Predicted Entities
`Cancer_Therapy`, `Posology_Information`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_en_4.2.2_3.0_1669309081671.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_en_4.2.2_3.0_1669309081671.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_unspecific_posology").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""")
```
## Results
```bash
| chunk | ner_label |
|:-----------------|:---------------------|
| adriamycin | Cancer_Therapy |
| 60 mg/m2 | Posology_Information |
| cyclophosphamide | Cancer_Therapy |
| 600 mg/m2 | Posology_Information |
| over six courses | Posology_Information |
| second cycle | Posology_Information |
| chemotherapy | Cancer_Therapy |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_unspecific_posology|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|34.3 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Posology_Information 2663 244 399 3062 0.92 0.87 0.89
Cancer_Therapy 2580 317 247 2827 0.89 0.91 0.90
macro_avg 5243 561 646 5889 0.90 0.89 0.90
micro_avg 5243 561 646 5889 0.90 0.89 0.90
```
---
layout: model
title: Generic Classifier for Adverse Drug Events (LogReg)
author: John Snow Labs
name: generic_logreg_classifier_ade
date: 2023-05-09
tags: [generic_classifier, logreg, clinical, licensed, en, text_classification, ade]
task: Text Classification
language: en
edition: Healthcare NLP 4.4.1
spark_version: 3.0
supported: true
annotator: GenericLogRegClassifierModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained with the Generic Classifier annotator and the Logistic Regression algorithm and classifies text/sentence into two categories:
True : The sentence is talking about a possible ADE
False : The sentence doesn’t have any information about an ADE.
The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False).
## Predicted Entities
`True`, `False`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/generic_logreg_classifier_ade_en_4.4.1_3.0_1683641152188.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/generic_logreg_classifier_ade_en_4.4.1_3.0_1683641152188.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
sentence_embeddings = SentenceEmbeddings() \
.setInputCols(["document", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
features_asm = FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
generic_classifier = GenericClassifierModel.pretrained("generic_logreg_classifier_ade", "en", "clinical/models")\
.setInputCols(["features"])\
.setOutputCol("class")
clf_Pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
features_asm,
generic_classifier])
data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text")
result = clf_Pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val word_embeddings = new WordEmbeddingsModel().pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
val sentence_embeddings = new SentenceEmbeddings()
.setInputCols(Array("document", "word_embeddings"))
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("features")
val generic_classifier = new GenericClassifierModel.pretrained("generic_logreg_classifier_ade", "en", "clinical/models")
.setInputCols("features")
.setOutputCol("class")
val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, word_embeddings, sentence_embeddings, features_asm, generic_classifier))
val data = Seq(Array("None of the patients required treatment for the overdose.", "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.")).toDS().toDF("text")
val result = clf_Pipeline.fit(data).transform(data)
```
## Results
```bash
+----------------------------------------------------------------------------------------+-------+
|text |result |
+----------------------------------------------------------------------------------------+-------+
|None of the patients required treatment for the overdose. |[False]|
|Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[True] |
+----------------------------------------------------------------------------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|generic_logreg_classifier_ade|
|Compatibility:|Healthcare NLP 4.4.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[feature_vector]|
|Output Labels:|[prediction]|
|Language:|en|
|Size:|17.0 KB|
## References
The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False).
Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615
## Benchmarking
```bash
label precision recall f1-score support
False 0.84 0.92 0.88 3362
True 0.74 0.57 0.64 1361
accuracy - - 0.82 4723
macro avg 0.79 0.74 0.76 4723
weighted avg 0.81 0.82 0.81 4723
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Altaic Languages
author: John Snow Labs
name: opus_mt_en_tut
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, tut, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `tut`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tut_xx_2.7.0_2.4_1609166499411.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tut_xx_2.7.0_2.4_1609166499411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_tut", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_tut", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.tut').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_tut|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_processor_with_lm TFWav2Vec2ForCTC from hf-internal-testing
author: John Snow Labs
name: pipeline_asr_processor_with_lm
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_processor_with_lm` is a English model originally trained by hf-internal-testing.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_processor_with_lm_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_processor_with_lm_en_4.2.0_3.0_1664025190161.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_processor_with_lm_en_4.2.0_3.0_1664025190161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_processor_with_lm', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_processor_with_lm", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_processor_with_lm|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|459.1 KB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Force Majeure Clause Binary Classifier (CUAD dataset)
author: John Snow Labs
name: legclf_cuad_force_majeure_clause
date: 2022-11-30
tags: [en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `force-majeure` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
There are other models in this dataset with similar title, but the difference is the dataset it was trained on. This one was trained with `cuad` dataset.
## Predicted Entities
`other`, `force-majeure`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_force_majeure_clause_en_1.0.0_3.0_1669806586316.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_force_majeure_clause_en_1.0.0_3.0_1669806586316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("clause_text") \
.setOutputCol("document")
embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = legal.ClassifierDLModel.pretrained("legclf_cuad_force_majeure_clause", "en", "legal/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
embeddings,
docClassifier])
df = spark.createDataFrame([["10 . FORCE-MAJEURE 10.1 Except for the obligations to make any payment , required by this Contract ( which shall not be subject to relief under this item ), a Party shall not be in breach of this Contract and liable to the other Party for any failure to fulfil any obligation under this Contract to the extent any fulfillment has been interfered with , hindered , delayed , or prevented by any circumstance whatsoever , which is not reasonably within the control of and is unforeseeable by such Party and if such Party exercised due diligence , including acts of God , fire , flood , freezing , landslides , lightning , earthquakes , fire , storm , floods , washouts , and other natural disasters , wars ( declared or undeclared ), insurrections , riots , civil disturbances , epidemics , quarantine restrictions , blockade , embargo , strike , lockouts , labor disputes , or restrictions imposed by any government ."]]).toDF("clause_text")
model = nlpPipeline.fit(df)
result = model.transform(df)
```
## Results
```bash
+-------+
| result|
+-------+
|[force-majeure]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_cuad_force_majeure_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.3 MB|
## References
In-house annotations on Cuad dataset.
## Benchmarking
```bash
label precision recall f1-score support
force-majeure 0.97 0.94 0.95 31
other 0.96 0.98 0.97 56
accuracy - - 0.97 87
macro-avg 0.97 0.96 0.96 87
weighted-avg 0.97 0.97 0.97 87
```
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-16-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10_en_4.0.0_3.0_1657184453535.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10_en_4.0.0_3.0_1657184453535.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-16-finetuned-squad-seed-10
---
layout: model
title: Pipeline to Mapping ICD10-CM Codes with Their Corresponding SNOMED Codes
author: John Snow Labs
name: icd10cm_snomed_mapping
date: 2022-06-27
tags: [icd10cm, snomed, pipeline, clinical, en, licensed, chunk_mapper]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of `icd10cm_snomed_mapper` model.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_snomed_mapping_en_3.5.3_3.0_1656361159581.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_snomed_mapping_en_3.5.3_3.0_1656361159581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline= PretrainedPipeline("icd10cm_snomed_mapping", "en", "clinical/models")
result= pipeline.fullAnnotate('R079 N4289 M62830')
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline= new PretrainedPipeline("icd10cm_snomed_mapping", "en", "clinical/models")
val result= pipeline.fullAnnotate("R079 N4289 M62830")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.icd10cm_to_snomed.pipe").predict("""R079 N4289 M62830""")
```
## Results
```bash
| | icd10cm_code | snomed_code |
|---:|:----------------------|:-----------------------------------------|
| 0 | R079 | N4289 | M62830 | 161972006 | 22035000 | 16410651000119105 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|icd10cm_snomed_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.1 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ChunkMapperModel
---
layout: model
title: Legal Waiver of jury trial Clause Binary Classifier
author: John Snow Labs
name: legclf_waiver_of_jury_trial_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `waiver-of-jury-trial` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `waiver-of-jury-trial`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_jury_trial_clause_en_1.0.0_3.2_1660123193810.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_jury_trial_clause_en_1.0.0_3.2_1660123193810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[waiver-of-jury-trial]|
|[other]|
|[other]|
|[waiver-of-jury-trial]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_waiver_of_jury_trial_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.96 0.99 0.98 102
waiver-of-jury-trial 0.96 0.86 0.91 28
accuracy - - 0.96 130
macro-avg 0.96 0.92 0.94 130
weighted-avg 0.96 0.96 0.96 130
```
---
layout: model
title: Modern Greek (1453-) BertForQuestionAnswering model (from Danastos)
author: John Snow Labs
name: bert_qa_newsqa_bert_el_Danastos
date: 2022-06-03
tags: [open_source, question_answering, bert]
task: Question Answering
language: el
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `newsqa_bert_el` is a Modern Greek (1453-) model orginally trained by `Danastos`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_newsqa_bert_el_Danastos_el_4.0.0_3.0_1654249941385.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_newsqa_bert_el_Danastos_el_4.0.0_3.0_1654249941385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_newsqa_bert_el_Danastos","el") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_newsqa_bert_el_Danastos","el")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("el.answer_question.news.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_newsqa_bert_el_Danastos|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|el|
|Size:|421.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Danastos/newsqa_bert_el
---
layout: model
title: English image_classifier_vit_rust_image_classification_3 ViTForImageClassification from SummerChiam
author: John Snow Labs
name: image_classifier_vit_rust_image_classification_3
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_3` is a English model originally trained by SummerChiam.
## Predicted Entities
`nonrust`, `rust`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_3_en_4.1.0_3.0_1660167270260.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_3_en_4.1.0_3.0_1660167270260.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_rust_image_classification_3", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_rust_image_classification_3", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_rust_image_classification_3|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Fast Neural Machine Translation Model from English to Lozi
author: John Snow Labs
name: opus_mt_en_loz
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, loz, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `loz`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_loz_xx_2.7.0_2.4_1609164276767.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_loz_xx_2.7.0_2.4_1609164276767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_loz", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_loz", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.loz').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_loz|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-32-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0_en_4.3.0_3.0_1674215168842.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0_en_4.3.0_3.0_1674215168842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|417.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-32-finetuned-squad-seed-0
---
layout: model
title: English RobertaForQuestionAnswering Mini Cased model (from sguskin)
author: John Snow Labs
name: roberta_qa_minilmv2_l6_h384_squad1.1
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minilmv2-L6-H384-squad1.1` is a English model originally trained by `sguskin`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_minilmv2_l6_h384_squad1.1_en_4.3.0_3.0_1674211435898.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_minilmv2_l6_h384_squad1.1_en_4.3.0_3.0_1674211435898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_minilmv2_l6_h384_squad1.1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_minilmv2_l6_h384_squad1.1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_minilmv2_l6_h384_squad1.1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|112.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/sguskin/minilmv2-L6-H384-squad1.1
---
layout: model
title: French CamemBert Embeddings (from JonathanSum)
author: John Snow Labs
name: camembert_embeddings_JonathanSum_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `JonathanSum`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_JonathanSum_generic_model_fr_3.4.4_3.0_1653986427511.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_JonathanSum_generic_model_fr_3.4.4_3.0_1653986427511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_JonathanSum_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_JonathanSum_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_JonathanSum_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/JonathanSum/dummy-model
---
layout: model
title: Stopwords Remover for Vietnamese language (1942 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, vi, open_source]
task: Stop Words Removal
language: vi
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_vi_3.4.1_3.0_1646672286429.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_vi_3.4.1_3.0_1646672286429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","vi") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Bạn không tốt hơn tôi"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","vi")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Bạn không tốt hơn tôi").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("vi.stopwords").predict("""Bạn không tốt hơn tôi""")
```
## Results
```bash
+------+
|result|
+------+
|[] |
+------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|vi|
|Size:|8.6 KB|
---
layout: model
title: Generic Deidentification NER
author: John Snow Labs
name: legner_deid
date: 2022-08-09
tags: [en, legal, ner, deid, licensed]
task: [De-identification, Named Entity Recognition]
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
recommended: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a NER model which allows you to detect some generic entities that may require to be masked or obfuscated to be compliant with different regulations, as GDPR and CCPA. This is just an NER model, make sure you try the full De-identification pipelines available in Models Hub.
## Predicted Entities
`AGE`, `CITY`, `COUNTRY`, `DATE`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `ORG`, `PERSON`, `PHONE`, `PROFESSION`, `STATE`, `STREET`, `URL`, `ZIP`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/DEID_LEGAL/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_deid_en_1.0.0_3.2_1660050699764.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_deid_en_1.0.0_3.2_1660050699764.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained('legner_deid', "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""
This LICENSE AND DEVELOPMENT AGREEMENT (this Agreement) is entered into effective as of Nov. 02, 2019 (the Effective Date) by and between Bioeq IP AG, having its principal place of business at 333 Twin Dolphin Drive, Suite 600, Redwood City, CA, 94065, USA (Licensee).
"""]
res = model.transform(spark.createDataFrame([text]).toDF("text"))
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions_bert", "de", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame([
["Die Temperaturen klettern am Wochenende."],
["Zu den Symptomen gehört u.a. eine verringerte Greifkraft."]
]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions_bert", "de", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq(Array("Die Temperaturen klettern am Wochenende.",
"Zu den Symptomen gehört u.a. eine verringerte Greifkraft.")).toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.classify.bert_sequence.health_mentions_bert").predict("""Zu den Symptomen gehört u.a. eine verringerte Greifkraft.""")
```
## Results
```bash
+---------------------------------------------------------+----------------+
|text |result |
+---------------------------------------------------------+----------------+
|Die Temperaturen klettern am Wochenende. |[non-health] |
|Zu den Symptomen gehört u.a. eine verringerte Greifkraft.|[health-related]|
+---------------------------------------------------------+----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_health_mentions_bert|
|Compatibility:|Healthcare NLP 4.0.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|de|
|Size:|409.8 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
Curated from several academic and in-house datasets.
## Benchmarking
```bash
label precision recall f1-score support
non-health 0.99 0.90 0.94 82
health-related 0.89 0.99 0.94 69
accuracy - - 0.94 151
macro-avg 0.94 0.94 0.94 151
weighted-avg 0.94 0.94 0.94 151
```
---
layout: model
title: Word2Vec Embeddings in Sakha (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, sah, open_source]
task: Embeddings
language: sah
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sah_3.4.1_3.0_1647455186697.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sah_3.4.1_3.0_1647455186697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sah") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sah")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("sah.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|sah|
|Size:|150.9 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Legal Financial Institutions And Credit Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_financial_institutions_and_credit_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, financial_institutions_and_credit, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_financial_institutions_and_credit_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Financial_Institutions_and_Credit or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Financial_Institutions_and_Credit`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_financial_institutions_and_credit_bert_en_1.0.0_3.0_1678111900978.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_financial_institutions_and_credit_bert_en_1.0.0_3.0_1678111900978.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Financial_Institutions_and_Credit]|
|[Other]|
|[Other]|
|[Financial_Institutions_and_Credit]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_financial_institutions_and_credit_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.4 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Financial_Institutions_and_Credit 0.87 0.89 0.88 81
Other 0.87 0.85 0.86 72
accuracy - - 0.87 153
macro-avg 0.87 0.87 0.87 153
weighted-avg 0.87 0.87 0.87 153
```
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from Nakul24)
author: John Snow Labs
name: roberta_qa_emotion_extraction
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RoBERTa-emotion-extraction` is a English model originally trained by `Nakul24`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_emotion_extraction_en_4.3.0_3.0_1674208609562.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_emotion_extraction_en_4.3.0_3.0_1674208609562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_emotion_extraction","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_emotion_extraction","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_emotion_extraction|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|426.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Nakul24/RoBERTa-emotion-extraction
---
layout: model
title: Legal Question Answering (Bert)
author: John Snow Labs
name: legqa_bert
date: 2022-08-09
tags: [en, legal, qa, licensed]
task: Question Answering
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Legal Bert-based Question Answering model, trained on squad-v2, finetuned on proprietary Legal questions and answers.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legqa_bert_en_1.0.0_3.2_1660054695560.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legqa_bert_en_1.0.0_3.2_1660054695560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
spanClassifier = nlp.BertForQuestionAnswering.pretrained("legqa_bert","en", "legal/models") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = nlp.Pipeline().setStages([
documentAssembler,
spanClassifier
])
example = spark.createDataFrame([["Who was subjected to torture?", "The applicant submitted that her husband was subjected to treatment amounting to abuse whilst in the custody of police."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
result.select('answer.result').show()
```
## Results
```bash
`her husband`
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legqa_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
Trained on squad-v2, finetuned on proprietary Legal questions and answers.
---
layout: model
title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP TFWav2Vec2ForCTC from Finnish-NLP
author: John Snow Labs
name: pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP` is a Finnish model originally trained by Finnish-NLP.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664025690267.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664025690267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP', lang = 'fi')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP", lang = "fi")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|3.5 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Translate English to Rundi Pipeline
author: John Snow Labs
name: translate_en_run
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, run, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `run`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_run_xx_2.7.0_2.4_1609688818752.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_run_xx_2.7.0_2.4_1609688818752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_run", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_run", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.run').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_run|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Translate English to Papiamento Pipeline
author: John Snow Labs
name: translate_en_pap
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, pap, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `pap`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_pap_xx_2.7.0_2.4_1609691878350.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_pap_xx_2.7.0_2.4_1609691878350.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_pap", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_pap", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.pap').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_pap|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Stockholder Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_stockholder_agreement_bert
date: 2022-11-25
tags: [en, legal, classification, agreement, stockholder, licensed, bert]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_stockholder_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `stockholder-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`stockholder-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_stockholder_agreement_bert_en_1.0.0_3.0_1669371874035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_stockholder_agreement_bert_en_1.0.0_3.0_1669371874035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[stockholder-agreement]|
|[other]|
|[other]|
|[stockholder-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_stockholder_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.93 1.00 0.96 65
stockholder-agreement 1.00 0.83 0.91 29
accuracy - - 0.95 94
macro-avg 0.96 0.91 0.93 94
weighted-avg 0.95 0.95 0.95 94
```
---
layout: model
title: Legal Oil Industry Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_oil_industry_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, oil_industry, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_oil_industry_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Oil_Industry or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Oil_Industry`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_oil_industry_bert_en_1.0.0_3.0_1678111691969.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_oil_industry_bert_en_1.0.0_3.0_1678111691969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Oil_Industry]|
|[Other]|
|[Other]|
|[Oil_Industry]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_oil_industry_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Oil_Industry 0.85 0.93 0.89 43
Other 0.92 0.82 0.87 40
accuracy - - 0.88 83
macro-avg 0.88 0.88 0.88 83
weighted-avg 0.88 0.88 0.88 83
```
---
layout: model
title: Korean ElectraForQuestionAnswering model (from monologg) Version-2
author: John Snow Labs
name: electra_qa_base_v2_finetuned_korquad_384
date: 2022-06-22
tags: [ko, open_source, electra, question_answering]
task: Question Answering
language: ko
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-base-v2-finetuned-korquad-384` is a Korean model originally trained by `monologg`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_v2_finetuned_korquad_384_ko_4.0.0_3.0_1655922143142.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_v2_finetuned_korquad_384_ko_4.0.0_3.0_1655922143142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_v2_finetuned_korquad_384","ko") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_v2_finetuned_korquad_384","ko")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.answer_question.korquad.electra.base_v2_384.by_monologg").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_base_v2_finetuned_korquad_384|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|ko|
|Size:|412.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/monologg/koelectra-base-v2-finetuned-korquad-384
---
layout: model
title: Pipeline to Classify Texts into TREC-6 Classes
author: John Snow Labs
name: bert_sequence_classifier_trec_coarse_pipeline
date: 2022-06-19
tags: [bert_sequence, trec, coarse, bert, en, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_sequence_classifier_trec_coarse_en](https://nlp.johnsnowlabs.com/2021/11/06/bert_sequence_classifier_trec_coarse_en.html).
The TREC dataset for question classification consists of open-domain, fact-based questions divided into broad semantic categories. You can check the official documentation of the dataset, entities, etc. [here](https://search.r-project.org/CRAN/refmans/textdata/html/dataset_trec.html).
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_trec_coarse_pipeline_en_4.0.0_3.0_1655653749614.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_trec_coarse_pipeline_en_4.0.0_3.0_1655653749614.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
trec_pipeline = PretrainedPipeline("bert_sequence_classifier_trec_coarse_pipeline", lang = "en")
trec_pipeline.annotate("Germany is the largest country in Europe economically.")
```
```scala
val trec_pipeline = new PretrainedPipeline("bert_sequence_classifier_trec_coarse_pipeline", lang = "en")
trec_pipeline.annotate("Germany is the largest country in Europe economically.")
```
## Results
```bash
['LOC']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_trec_coarse_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|406.6 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- BertForSequenceClassification
---
layout: model
title: Extract medical devices and clinical department mentions (Voice of the Patients)
author: John Snow Labs
name: ner_vop_clinical_dept_wip
date: 2023-05-19
tags: [licensed, clinical, en, ner, vop, patient]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts medical devices and clinical department mentions terms from the documents transferred from the patient’s own sentences.
Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases.
## Predicted Entities
`AdmissionDischarge`, `ClinicalDept`, `MedicalDevice`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_wip_en_4.4.2_3.0_1684512218256.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_wip_en_4.4.2_3.0_1684512218256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:----------------------|:--------------|
| orthopedic department | ClinicalDept |
| titanium plate | MedicalDevice |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_clinical_dept_wip|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.8 MB|
|Dependencies:|embeddings_clinical|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
AdmissionDischarge 29 1 5 34 0.97 0.85 0.91
ClinicalDept 292 41 34 326 0.88 0.90 0.89
MedicalDevice 244 72 88 332 0.77 0.73 0.75
macro_avg 565 114 127 692 0.87 0.83 0.85
micro_avg 565 114 127 692 0.83 0.82 0.82
```
---
layout: model
title: Explain Document Pipeline for Dutch
author: John Snow Labs
name: explain_document_md
date: 2021-03-22
tags: [open_source, dutch, explain_document_md, pipeline, nl]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: nl
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_nl_3.0.0_3.0_1616434945966.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_nl_3.0.0_3.0_1616434945966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_md', lang = 'nl')
annotations = pipeline.fullAnnotate(""Hallo van John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_md", lang = "nl")
val result = pipeline.fullAnnotate("Hallo van John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hallo van John Snow Labs! ""]
result_df = nlu.load('nl.explain.md').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:-------------------------------|:------------------------------|:------------------------------------------|:------------------------------------------|:--------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Hallo van John Snow Labs! '] | ['Hallo van John Snow Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.5910000205039978,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_md|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|nl|
---
layout: model
title: Fast Neural Machine Translation Model from Tigrinya to English
author: John Snow Labs
name: opus_mt_ti_en
date: 2020-12-29
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, ti, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `ti`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ti_en_xx_2.7.0_2.4_1609254492007.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ti_en_xx_2.7.0_2.4_1609254492007.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ti_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ti_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.ti.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ti_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering model (from xraychen)
author: John Snow Labs
name: bert_qa_mqa_baseline
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mqa-baseline` is a English model orginally trained by `xraychen`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_baseline_en_4.0.0_3.0_1654188319379.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_baseline_en_4.0.0_3.0_1654188319379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mqa_baseline","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_mqa_baseline","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.base.by_xraychen").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_mqa_baseline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/xraychen/mqa-baseline
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265898` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898_en_4.0.0_3.0_1655984453355.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898_en_4.0.0_3.0_1655984453355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265898").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|888.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265898
---
layout: model
title: RoBERTa base biomedical
author: ireneisdoomed
name: roberta_base_biomedical
date: 2022-01-13
tags: [es, open_source]
task: Text Classification
language: es
edition: Spark NLP 3.4.0
spark_version: 3.0
supported: false
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model has been pulled from the HF Hub - https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es
This is a result of reproducing the tutorial for bringing HF's models into Spark NLP - https://medium.com/spark-nlp/importing-huggingface-models-into-sparknlp-8c63bdea671d
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ireneisdoomed/roberta_base_biomedical_es_3.4.0_3.0_1642093372752.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/ireneisdoomed/roberta_base_biomedical_es_3.4.0_3.0_1642093372752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_base_biomedical|
|Compatibility:|Spark NLP 3.4.0+|
|License:|Open Source|
|Edition:|Community|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|301.7 MB|
---
layout: model
title: Extract entities in clinical trial abstracts
author: John Snow Labs
name: ner_clinical_trials_abstracts
date: 2022-06-22
tags: [ner, clinical, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Named Entity Recognition model uses a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN.
It extracts relevant entities from clinical trial abstracts. It uses a simplified version of the ontology specified by [Sanchez Graillet, O., et al.](https://pub.uni-bielefeld.de/record/2939477) in order to extract concepts related to trial design, diseases, drugs, population, statistics and publication.
## Predicted Entities
`Age`, `AllocationRatio`, `Author`, `BioAndMedicalUnit`, `CTAnalysisApproach`, `CTDesign`, `Confidence`, `Country`, `DisorderOrSyndrome`, `DoseValue`, `Drug`, `DrugTime`, `Duration`, `Journal`, `NumberPatients`, `PMID`, `PValue`, `PercentagePatients`, `PublicationYear`, `TimePoint`, `Value`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_en_3.5.3_3.0_1655911616789.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_en_3.5.3_3.0_1655911616789.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_trials_abstracts", "en", "clinical/models")\
.setInputCols(["sentence","token", "embeddings"])\
.setOutputCol("ner")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner])
text = ["A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime."]
data = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical_trials_abstracts", "en", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner))
val text = "A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime."
val data = Seq(text).toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.ner.clinical_trials_abstracts").predict("""A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hyan97_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hyan97_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_hyan97_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/hyan97/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Pipeline to Detect Drug Information
author: John Snow Labs
name: ner_posology_large_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, drug, en]
task: [Named Entity Recognition, Pipeline Healthcare]
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_posology_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_posology_large_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_pipeline_en_3.4.1_3.0_1647873112906.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_pipeline_en_3.4.1_3.0_1647873112906.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_posology_large_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_posology_large_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.posoloy_large.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""")
```
## Results
```bash
+--------------+---------+
|chunk |ner |
+--------------+---------+
|insulin |DRUG |
|Bactrim |DRUG |
|for 14 days |DURATION |
|Fragmin |DRUG |
|5000 units |DOSAGE |
|subcutaneously|ROUTE |
|daily |FREQUENCY|
|Xenaderm |DRUG |
|topically |ROUTE |
|b.i.d |FREQUENCY|
|Lantus |DRUG |
|40 units |DOSAGE |
|subcutaneously|ROUTE |
|at bedtime |FREQUENCY|
|OxyContin |DRUG |
|30 mg |STRENGTH |
|p.o |ROUTE |
|q.12 h |FREQUENCY|
|folic acid |DRUG |
|1 mg |STRENGTH |
+--------------+---------+
only showing top 20 rows
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_posology_large_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Explain Document Pipeline for French
author: John Snow Labs
name: explain_document_md
date: 2021-03-22
tags: [open_source, french, explain_document_md, pipeline, fr]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: fr
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_fr_3.0.0_3.0_1616429735046.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_fr_3.0.0_3.0_1616429735046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_md', lang = 'fr')
annotations = pipeline.fullAnnotate(""Bonjour de John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_md", lang = "fr")
val result = pipeline.fullAnnotate("Bonjour de John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Bonjour de John Snow Labs! ""]
result_df = nlu.load('fr.explain.md').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:--------------------------------|:-------------------------------|:-------------------------------------------|:-------------------------------------------|:-------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------------------|
| 0 | ['Bonjour de John Snow Labs! '] | ['Bonjour de John Snow Labs!'] | ['Bonjour', 'de', 'John', 'Snow', 'Labs!'] | ['Bonjour', 'de', 'John', 'Snow', 'Labs!'] | ['INTJ', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0783179998397827,.,...]] | ['I-MISC', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['Bonjour', 'John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_md|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fr|
---
layout: model
title: Word2Vec Embeddings in Yiddish (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, yi, open_source]
task: Embeddings
language: yi
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_yi_3.4.1_3.0_1647467653837.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_yi_3.4.1_3.0_1647467653837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","yi") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["איך ליבע אָנצינדן נלפּ"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","yi")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("איך ליבע אָנצינדן נלפּ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("yi.embed.w2v_cc_300d").predict("""איך ליבע אָנצינדן נלפּ""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|yi|
|Size:|114.5 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Japanese Lemmatizer
author: John Snow Labs
name: lemma
date: 2021-01-15
task: Lemmatization
language: ja
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [ja, lemmatizer, open_source]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ja_2.7.0_2.4_1610746691356.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ja_2.7.0_2.4_1610746691356.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
word_segmenter = WordSegmenterModel.pretrained('wordseg_gsd_ud', 'ja')\
.setInputCols("document")\
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma", "ja") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, word_segmenter , lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
results = light_pipeline.fullAnnotate(["これに不快感を示す住民はいましたが,現在,表立って反対や抗議の声を挙げている住民はいないようです。"])
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val word_segmenter = WordSegmenterModel.pretrained('wordseg_gsd_ud', 'ja')
.setInputCols("document")
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma", "ja")
.setInputCols("token")
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter , lemmatizer))
val data = Seq("これに不快感を示す住民はいましたが,現在,表立って反対や抗議の声を挙げている住民はいないようです。").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["これに不快感を示す住民はいましたが,現在,表立って反対や抗議の声を挙げている住民はいないようです。"]
lemma_df = nlu.load('ja.lemma').predict(text, output_level = "document")
lemma_df.lemma.values[0]
```
## Results
```bash
{'lemma': [Annotation(token, 0, 1, これ, {'sentence': '0'}),
Annotation(token, 2, 2, にる, {'sentence': '0'}),
Annotation(token, 3, 4, 不快, {'sentence': '0'}),
Annotation(token, 5, 5, 感, {'sentence': '0'}),
Annotation(token, 6, 6, を, {'sentence': '0'}),
Annotation(token, 7, 8, 示す, {'sentence': '0'}),
Annotation(token, 9, 10, 住民, {'sentence': '0'}),
Annotation(token, 11, 11, はる, {'sentence': '0'}),
Annotation(token, 12, 12, いる, {'sentence': '0'}),
Annotation(token, 13, 14, まする, {'sentence': '0'}),
Annotation(token, 15, 15, たる, {'sentence': '0'}),
Annotation(token, 16, 16, がる, {'sentence': '0'}),
Annotation(token, 17, 17, ,, {'sentence': '0'}),
Annotation(token, 18, 19, 現在, {'sentence': '0'}),
Annotation(token, 20, 20, ,, {'sentence': '0'}),
Annotation(token, 21, 23, 表立つ, {'sentence': '0'}),
Annotation(token, 24, 24, てる, {'sentence': '0'}),
Annotation(token, 25, 26, 反対, {'sentence': '0'}),
Annotation(token, 27, 27, やる, {'sentence': '0'}),
Annotation(token, 28, 29, 抗議, {'sentence': '0'}),
Annotation(token, 30, 30, のる, {'sentence': '0'}),
Annotation(token, 31, 31, 声, {'sentence': '0'}),
Annotation(token, 32, 32, を, {'sentence': '0'}),
Annotation(token, 33, 34, 挙げる, {'sentence': '0'}),
Annotation(token, 35, 35, てる, {'sentence': '0'}),
Annotation(token, 36, 37, いる, {'sentence': '0'}),
Annotation(token, 38, 39, 住民, {'sentence': '0'}),
Annotation(token, 40, 40, はる, {'sentence': '0'}),
Annotation(token, 41, 41, いる, {'sentence': '0'}),
Annotation(token, 42, 43, なぐ, {'sentence': '0'}),
Annotation(token, 44, 45, よう, {'sentence': '0'}),
Annotation(token, 46, 47, です, {'sentence': '0'}),
Annotation(token, 48, 48, 。, {'sentence': '0'})]}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[token]|
|Language:|ja|
## Data Source
The model was trained using the universal dependencies data set version 2 and the _IPADIC_ dictionary from [Mecab](https://taku910.github.io/mecab/).
References:
> - Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018.
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_kv16
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-kv16` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv16_en_4.3.0_3.0_1675121314422.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv16_en_4.3.0_3.0_1675121314422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_kv16","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_kv16","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_kv16|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|120.7 MB|
## References
- https://huggingface.co/google/t5-efficient-small-kv16
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Lewotobi RobertaForQuestionAnswering (from 21iridescent)
author: John Snow Labs
name: roberta_qa_RoBERTa_base_finetuned_squad2_lwt
date: 2022-06-20
tags: [open_source, question_answering, roberta]
task: Question Answering
language: lwt
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RoBERTa-base-finetuned-squad2-lwt` is a Lewotobi model originally trained by `21iridescent`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_RoBERTa_base_finetuned_squad2_lwt_lwt_4.0.0_3.0_1655727062223.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_RoBERTa_base_finetuned_squad2_lwt_lwt_4.0.0_3.0_1655727062223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_RoBERTa_base_finetuned_squad2_lwt","lwt") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_RoBERTa_base_finetuned_squad2_lwt","lwt")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("lwt.answer_question.squadv2.roberta.base.by_21iridescent").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_RoBERTa_base_finetuned_squad2_lwt|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|lwt|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/21iridescent/RoBERTa-base-finetuned-squad2-lwt
---
layout: model
title: Professions & Occupations NER model in Spanish (meddroprof_scielowiki)
author: John Snow Labs
name: meddroprof_scielowiki
date: 2022-12-18
tags: [ner, licensed, prefessions, es, occupations]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
NER model that detects professions and occupations in Spanish texts. Trained with the `embeddings_scielowiki_300d` embeddings, and the same `WordEmbeddingsModel` is needed in the pipeline.
## Predicted Entities
`ACTIVIDAD`, `PROFESION`, `SITUACION_LABORAL`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_PROFESSIONS_ES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_PROFESSIONS_ES.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/meddroprof_scielowiki_es_4.2.2_3.0_1671367707210.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/meddroprof_scielowiki_es_4.2.2_3.0_1671367707210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols("document") \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter])
sample_text = """La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO"""
df = spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(df).transform(df)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter))
val data = Seq("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.med_ner.scielowiki").predict("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""")
```
## Results
```bash
+---------------------------------------+-----------------+
|chunk |ner_label |
+---------------------------------------+-----------------+
|estudiando 1o ESO |SITUACION_LABORAL|
|ATS |PROFESION |
|trabajan en diferentes centros de salud|PROFESION |
|estudiando 1o ESO |SITUACION_LABORAL|
+---------------------------------------+-----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|meddroprof_scielowiki|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|14.8 MB|
## References
The model was trained with the [MEDDOPROF](https://temu.bsc.es/meddoprof/data/) data set:
> The MEDDOPROF corpus is a collection of 1844 clinical cases from over 20 different specialties annotated with professions and employment statuses. The corpus was annotated by a team composed of linguists and clinical experts following specially prepared annotation guidelines, after several cycles of quality control and annotation consistency analysis before annotating the entire dataset. Figure 1 shows a screenshot of a sample manual annotation generated using the brat annotation tool.
Reference:
```
@article{meddoprof,
title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2022 on automatic recognition, classification and normalization of professions and occupations from medical texts},
author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin},
journal = {Procesamiento del Lenguaje Natural},
volume = {67},
year={2022}
}
```
## Benchmarking
```bash
label precision recall f1-score support
B-ACTIVIDAD 0.82 0.36 0.50 25
B-PROFESION 0.87 0.75 0.81 634
B-SITUACION_LABORAL 0.79 0.67 0.72 310
I-ACTIVIDAD 0.86 0.43 0.57 58
I-PROFESION 0.87 0.80 0.83 944
I-SITUACION_LABORAL 0.74 0.71 0.73 407
O 1.00 1.00 1.00 139880
accuracy - - 0.99 142258
macro-avg 0.85 0.67 0.74 142258
weighted-avg 0.99 0.99 0.99 142258
```
---
layout: model
title: English RobertaForQuestionAnswering (from veronica320)
author: John Snow Labs
name: roberta_qa_QA_for_Event_Extraction
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `QA-for-Event-Extraction` is a English model originally trained by `veronica320`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_QA_for_Event_Extraction_en_4.0.0_3.0_1655726863853.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_QA_for_Event_Extraction_en_4.0.0_3.0_1655726863853.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_QA_for_Event_Extraction","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_QA_for_Event_Extraction","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.roberta.by_veronica320").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_QA_for_Event_Extraction|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/veronica320/QA-for-Event-Extraction
- https://aclanthology.org/2021.acl-short.42/
- https://github.com/veronica320/Zeroshot-Event-Extraction
- https://github.com/uwnlp/qamr
---
layout: model
title: Slovak BertForMaskedLM Cased model (from fav-kky)
author: John Snow Labs
name: bert_embeddings_fernet_cc
date: 2022-12-02
tags: [sk, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: sk
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `FERNET-CC_sk` is a Slovak model originally trained by `fav-kky`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_fernet_cc_sk_4.2.4_3.0_1670015248677.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_fernet_cc_sk_4.2.4_3.0_1670015248677.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fernet_cc","sk") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fernet_cc","sk")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_fernet_cc|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|sk|
|Size:|612.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/fav-kky/FERNET-CC_sk
- https://arxiv.org/abs/2107.10042
---
layout: model
title: Legal NER for NDA (Confidential Information-Restricted)
author: John Snow Labs
name: legner_nda_confidential_information_restricted
date: 2023-04-11
tags: [en, legal, licensed, ner, nda]
task: Named Entity Recognition
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a NER model, aimed to be run **only** after detecting the `USE_OF_CONF_INFO ` clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other for that purpose). It will extract the following entities: `RESTRICTED_ACTION`, `RESTRICTED_SUBJECT`, `RESTRICTED_OBJECT`, and `RESTRICTED_IND_OBJECT`.
## Predicted Entities
`RESTRICTED_ACTION`, `RESTRICTED_SUBJECT`, `RESTRICTED_OBJECT`, `RESTRICTED_IND_OBJECT`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_confidential_information_restricted_en_1.0.0_3.0_1681210372591.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_confidential_information_restricted_en_1.0.0_3.0_1681210372591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_nda_confidential_information_restricted", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""The recipient may use the proprietary information solely for the purpose of performing its obligations under a separate agreement with the disclosing party, and may not disclose such information to any third party without the prior written consent of the disclosing party."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+-----------+---------------------+
|chunk |ner_label |
+-----------+---------------------+
|recipient |RESTRICTED_SUBJECT |
|disclose |RESTRICTED_ACTION |
|information|RESTRICTED_OBJECT |
|third party|RESTRICTED_IND_OBJECT|
+-----------+---------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_nda_confidential_information_restricted|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.3 MB|
## References
In-house annotations on the Non-disclosure Agreements
## Benchmarking
```bash
label precision recall f1-score support
RESTRICTED_ACTION 0.92 0.94 0.93 36
RESTRICTED_IND_OBJECT 1.00 0.93 0.97 15
RESTRICTED_OBJECT 0.74 1.00 0.85 26
RESTRICTED_SUBJECT 0.72 0.90 0.80 29
micro-avg 0.82 0.94 0.88 106
macro-avg 0.85 0.94 0.89 106
weighted-avg 0.83 0.94 0.88 106
```
---
layout: model
title: Financial NER (Signers)
author: John Snow Labs
name: finner_signers
date: 2023-02-24
tags: [signers, parties, en, licensed]
task: Named Entity Recognition
language: en
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: FinanceNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Legal NER Model, aimed to process the last page of the agreements when information can be found about:
- People Signing the document;
- Title of those people in their companies;
- Company (Party) they represent;
## Predicted Entities
`SIGNING_TITLE`, `SIGNING_PERSON`, `PARTY`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/LEGALNER_SIGNERS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_signers_en_1.0.0_3.0_1677258652137.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_signers_en_1.0.0_3.0_1677258652137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained('finner_signers', 'en', 'finance/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = """
VENDOR:
VENDINGDATA CORPORATION, a Nevada corporation
By: /s/ Steven J. Blad
Its: Steven J. Blad CEO
DISTRIBUTOR:
TECHNICAL CASINO SUPPLIES LTD, an English company
By: /s/ David K. Heap
Its: David K. Heap Chief Executive Officer
-15-"""
res = model.transform(spark.createDataFrame([[text]]).toDF("text"))
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_becas_1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-1
---
layout: model
title: Pipeline to Extract Cancer Therapies and Posology Information
author: John Snow Labs
name: ner_oncology_unspecific_posology_healthcare_pipeline
date: 2023-03-08
tags: [licensed, clinical, oncology, en, ner, treatment, posology]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_oncology_unspecific_posology_healthcare](https://nlp.johnsnowlabs.com/2023/01/11/ner_oncology_unspecific_posology_healthcare_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_healthcare_pipeline_en_4.3.0_3.2_1678269380685.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_healthcare_pipeline_en_4.3.0_3.2_1678269380685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_oncology_unspecific_posology_healthcare_pipeline", "en", "clinical/models")
text = "
he patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.
"
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_oncology_unspecific_posology_healthcare_pipeline", "en", "clinical/models")
val text = "
he patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.
"
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | chunks | begin | end | entities | confidence |
|---:|:-----------------|--------:|------:|:---------------------|-------------:|
| 0 | adriamycin | 46 | 55 | Cancer_Therapy | 0.9999 |
| 1 | 60 mg/m2 | 58 | 65 | Posology_Information | 0.807 |
| 2 | cyclophosphamide | 72 | 87 | Cancer_Therapy | 0.9998 |
| 3 | 600 mg/m2 | 90 | 98 | Posology_Information | 0.9566 |
| 4 | over six courses | 101 | 116 | Posology_Information | 0.689833 |
| 5 | second cycle | 150 | 161 | Posology_Information | 0.9906 |
| 6 | chemotherapy | 166 | 177 | Cancer_Therapy | 0.9997 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_unspecific_posology_healthcare_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|533.1 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Korean RoBERTa Embeddings (from lassl)
author: John Snow Labs
name: roberta_embeddings_roberta_ko_small
date: 2022-04-14
tags: [roberta, embeddings, ko, open_source]
task: Embeddings
language: ko
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-ko-small` is a Korean model orginally trained by `lassl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_ko_small_ko_3.4.2_3.0_1649947873838.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_ko_small_ko_3.4.2_3.0_1649947873838.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_ko_small","ko") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_ko_small","ko")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.embed.roberta_ko_small").predict("""나는 Spark NLP를 좋아합니다""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_roberta_ko_small|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ko|
|Size:|87.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/lassl/roberta-ko-small
- https://github.com/lassl/lassl
---
layout: model
title: Named Entity Recognition - ELECTRA Large (OntoNotes)
author: John Snow Labs
name: onto_electra_large_uncased
date: 2020-12-05
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [ner, en, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc.
This model uses the pretrained `electra_large_uncased` embeddings model from the `BertEmbeddings` annotator as an input.
## Predicted Entities
`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_electra_large_uncased_en_2.7.0_2.4_1607198670231.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_electra_large_uncased_en_2.7.0_2.4_1607198670231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("electra_large_uncased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
ner_onto = NerDLModel.pretrained("onto_electra_large_uncased", "en") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("electra_large_uncased", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_onto = NerDLModel.pretrained("onto_electra_large_uncased", "en")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter))
val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""]
ner_df = nlu.load('en.ner.onto.electra.uncased_large').predict(text, output_level='chunk')
ner_df[["entities", "entities_class"]]
```
{:.h2_title}
## Results
```bash
+---------------------+---------+
|chunk |ner_label|
+---------------------+---------+
|William Henry Gates |PERSON |
|October 28, 1955 |DATE |
|American |NORP |
|Microsoft Corporation|ORG |
|Microsoft |ORG |
|Gates |PERSON |
|CEO |ORG |
|May 2014 |DATE |
|one |CARDINAL |
|the 1970s and 1980s |DATE |
|Seattle |GPE |
|Washington |GPE |
|Gates |PERSON |
|Microsoft |FAC |
|Paul Allen |PERSON |
|1975 |DATE |
|Albuquerque |GPE |
|New Mexico |GPE |
|Gates |PERSON |
|January 2000 |DATE |
+---------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|onto_electra_large_uncased|
|Type:|ner|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
The model is trained based on data from OntoNotes 5.0 [https://catalog.ldc.upenn.edu/LDC2013T19](https://catalog.ldc.upenn.edu/LDC2013T19)
## Benchmarking
```bash
Micro-average:
prec: 0.88980144, rec: 0.88069624, f1: 0.8852254
CoNLL Eval:
processed 152728 tokens with 11257 phrases; found: 11227 phrases; correct: 9876.
accuracy: 97.64%; 9876 11257 11227 precision: 87.97%; recall: 87.73%; FB1: 87.85
CARDINAL: 789 935 937 precision: 84.20%; recall: 84.39%; FB1: 84.29 937
DATE: 1399 1602 1640 precision: 85.30%; recall: 87.33%; FB1: 86.30 1640
EVENT: 30 63 43 precision: 69.77%; recall: 47.62%; FB1: 56.60 43
FAC: 72 135 115 precision: 62.61%; recall: 53.33%; FB1: 57.60 115
GPE: 2131 2240 2252 precision: 94.63%; recall: 95.13%; FB1: 94.88 2252
LANGUAGE: 8 22 9 precision: 88.89%; recall: 36.36%; FB1: 51.61 9
LAW: 20 40 31 precision: 64.52%; recall: 50.00%; FB1: 56.34 31
LOC: 123 179 202 precision: 60.89%; recall: 68.72%; FB1: 64.57 202
MONEY: 286 314 321 precision: 89.10%; recall: 91.08%; FB1: 90.08 321
NORP: 803 841 918 precision: 87.47%; recall: 95.48%; FB1: 91.30 918
ORDINAL: 177 195 218 precision: 81.19%; recall: 90.77%; FB1: 85.71 218
ORG: 1502 1795 1687 precision: 89.03%; recall: 83.68%; FB1: 86.27 1687
PERCENT: 306 349 344 precision: 88.95%; recall: 87.68%; FB1: 88.31 344
PERSON: 1887 1988 2020 precision: 93.42%; recall: 94.92%; FB1: 94.16 2020
PRODUCT: 48 76 62 precision: 77.42%; recall: 63.16%; FB1: 69.57 62
QUANTITY: 85 105 111 precision: 76.58%; recall: 80.95%; FB1: 78.70 111
TIME: 128 212 190 precision: 67.37%; recall: 60.38%; FB1: 63.68 190
WORK_OF_ART: 82 166 127 precision: 64.57%; recall: 49.40%; FB1: 55.97 127
```
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8 TFWav2Vec2ForCTC from lilitket
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8` is a English model originally trained by lilitket.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8_en_4.2.0_3.0_1664121225531.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8_en_4.2.0_3.0_1664121225531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Norwegian BertForMaskedLM Cased model (from ltgoslo)
author: John Snow Labs
name: bert_embeddings_norbert
date: 2022-12-06
tags: ["no", open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: "no"
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `norbert` is a Norwegian model originally trained by `ltgoslo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert_no_4.2.4_3.0_1670326996300.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert_no_4.2.4_3.0_1670326996300.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert","no") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert","no")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_norbert|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|no|
|Size:|417.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/ltgoslo/norbert
- http://vectors.nlpl.eu/repository/20/216.zip
- http://norlm.nlpl.eu/
- https://github.com/ltgoslo/NorBERT
- https://arxiv.org/abs/2104.06546
- https://www.eosc-nordic.eu/
- https://www.mn.uio.no/ifi/english/research/projects/sant/index.html
- https://www.mn.uio.no/ifi/english/research/groups/ltg/
---
layout: model
title: Detect Oncology-Specific Entities
author: John Snow Labs
name: ner_oncology_wip
date: 2022-09-30
tags: [licensed, clinical, oncology, en, ner, biomarker, treatment]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts more than 40 oncology-related entities, including therapies, tests and staging.
Definitions of Predicted Entities:
- `Adenopathy`: Mentions of pathological findings of the lymph nodes.
- `Age`: All mention of ages, past or present, related to the patient or with anybody else.
- `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category.
- `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers.
- `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction.
- `Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. "BI-RADS" or "Allred score").
- `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment.
- `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as "chemotherapy".
- `Cycle_Coun`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles").
- `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5").
- `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle").
- `Date`: Mentions of exact dates, in any format, including day number, month and/or year.
- `Death_Entity`: Words that indicate the death of the patient or someone else (including family members), such as "died" or "passed away".
- `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower".
- `Dosage`: The quantity prescribed by the physician for an active ingredient.
- `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks").
- `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid").
- `Gender`: Gender-specific nouns and pronouns (including words such as "him" or "she", and family members such as "father").
- `Grade`: All pathological grading of tumors (e.g. "grade 1") or degrees of cellular differentiation (e.g. "well-differentiated")
- `Histological_Type`: Histological variants or cancer subtypes, such as "papillary", "clear cell" or "medullary".
- `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as "hormonal therapy".
- `Imaging_Test`: Imaging tests mentioned in texts, such as "chest CT scan".
- `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as "immunotherapy".
- `Invasion`: Mentions that refer to tumor invasion, such as "invasion" or "involvement". Metastases or lymph node involvement are excluded from this category.
- `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment").
- `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions.
- `Oncogene`: Mentions of genes that are implicated in the etiology of cancer.
- `Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. "malignant ductal cells").
- `Pathology_Test`: Mentions of biopsies or tests that use tissue samples.
- `Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. "ECOG performance status of 4").
- `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups.
- `Radiotherapy`: Terms that indicate the use of Radiotherapy.
- `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement".
- `Relative_Date`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "yesterday" or "three years later").
- `Route`: Words indicating the type of administration route (such as "PO" or "transdermal").
- `Site_Bone`: Anatomical terms that refer to the human skeleton.
- `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum).
- `Site_Breast`: Anatomical terms that refer to the breasts.
- `Site_Liver`: Anatomical terms that refer to the liver.
- `Site_Lung`: Anatomical terms that refer to the lungs.
- `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies.
- `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities.
- `Smoking_Status`: All mentions of smoking related to the patient or to someone else.
- `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced".
- `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as "targeted therapy".
- `Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm").
- `Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. "3 cm").
- `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. "chemoradiotherapy" or "adjuvant therapy").
## Predicted Entities
`Histological_Type`, `Direction`, `Staging`, `Cancer_Score`, `Imaging_Test`, `Cycle_Number`, `Tumor_Finding`, `Site_Lymph_Node`, `Invasion`, `Response_To_Treatment`, `Smoking_Status`, `Tumor_Size`, `Cycle_Count`, `Adenopathy`, `Age`, `Biomarker_Result`, `Unspecific_Therapy`, `Site_Breast`, `Chemotherapy`, `Targeted_Therapy`, `Radiotherapy`, `Performance_Status`, `Pathology_Test`, `Site_Other_Body_Part`, `Cancer_Surgery`, `Line_Of_Therapy`, `Pathology_Result`, `Hormonal_Therapy`, `Site_Bone`, `Biomarker`, `Immunotherapy`, `Cycle_Day`, `Frequency`, `Route`, `Duration`, `Death_Entity`, `Metastasis`, `Site_Liver`, `Cancer_Dx`, `Grade`, `Date`, `Site_Lung`, `Site_Brain`, `Relative_Date`, `Race_Ethnicity`, `Gender`, `Oncogene`, `Dosage`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_wip_en_4.0.0_3.0_1664556885893.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_wip_en_4.0.0_3.0_1664556885893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.
The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.
The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_wip").predict("""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_german_qg_quad","de") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_german_qg_quad","de")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_german_qg_quad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|de|
|Size:|923.3 MB|
## References
- https://huggingface.co/dehio/german-qg-t5-quad
- https://www.deepset.ai/germanquad
- https://github.com/d-e-h-i-o/german-qg
---
layout: model
title: English RobertaForQuestionAnswering (from akdeniz27)
author: John Snow Labs
name: roberta_qa_roberta_large_cuad
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-cuad` is a English model originally trained by `akdeniz27`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_cuad_en_4.0.0_3.0_1655736445187.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_cuad_en_4.0.0_3.0_1655736445187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_cuad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_large_cuad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.cuad.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_large_cuad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/akdeniz27/roberta-large-cuad
- https://github.com/TheAtticusProject/cuad
- https://github.com/marshmellow77/cuad-demo
---
layout: model
title: French CamemBert Embeddings (from Yanzhu)
author: John Snow Labs
name: camembert_embeddings_bertweetfr_base
date: 2022-05-23
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertweetfr-base` is a French model orginally trained by `Yanzhu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_bertweetfr_base_fr_3.4.4_3.0_1653320961942.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_bertweetfr_base_fr_3.4.4_3.0_1653320961942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_bertweetfr_base","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_bertweetfr_base","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_bertweetfr_base|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|415.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Yanzhu/bertweetfr-base
---
layout: model
title: Detect Clinical Entities (bert_token_classifier_ner_jsl)
author: John Snow Labs
name: bert_token_classifier_ner_jsl
date: 2021-08-28
tags: [ner, en, licensed, clinical]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.2.0
spark_version: 2.4
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is BERT-based version of `ner_jsl` model and it is better than the legacy NER model (MedicalNerModel) that is based on BiLSTM-CNN-Char architecture.
Definitions of Predicted Entities:
- `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else.
- `Direction`: All the information relating to the laterality of the internal and external organs.
- `Test`: Mentions of laboratory, pathology, and radiological tests.
- `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient.
- `Death_Entity`: Mentions that indicate the death of a patient.
- `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced).
- `Duration`: The duration of a medical treatment or medication use.
- `Respiration`: Number of breaths per minute.
- `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims.
- `Birth_Entity`: Mentions that indicate giving birth.
- `Age`: All mention of ages, past or present, related to the patient or with anybody else.
- `Labour_Delivery`: Extractions include stages of labor and delivery.
- `Family_History_Header`: identifies section headers that correspond to Family History of the patient.
- `BMI`: Numeric values and other text information related to Body Mass Index.
- `Temperature`: All mentions that refer to body temperature.
- `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else.
- `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic").
- `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else.
- `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient.
- `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events.
- `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP).
- `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements.
- `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else.
- `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases.
- `Employment`: All mentions of patient or provider occupational titles and employment status .
- `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels).
- `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.).
- `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.).
- `ImagingFindings`: All mentions of radiographic and imagistic findings.
- `Procedure`: All mentions of invasive medical or surgical procedures or treatments.
- `Medical_Device`: All mentions related to medical devices and supplies.
- `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups.
- `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels).
- `Symptom`: All the symptoms mentioned in the document, of a patient or someone else.
- `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure").
- `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs).
- `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Drug_Ingredient`: Active ingredient/s found in drug products.
- `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted.
- `Diet`: All mentions and information regarding patients dietary habits.
- `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye.
- `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein).
- `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs.
- `Allergen`: Allergen related extractions mentioned in the document.
- `EKG_Findings`: All mentions of EKG readings.
- `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology.
- `Triglycerides`: All mentions terms related to specific lab test for Triglycerides.
- `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning").
- `Gender`: Gender-specific nouns and pronouns.
- `Pulse`: Peripheral heart rate, without advanced information like measurement location.
- `Social_History_Header`: Identifies section headers that correspond to Social History of a patient.
- `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs).
- `Diabetes`: All terms related to diabetes mellitus.
- `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately.
- `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye.
- `Clinical_Dept`: Terms that indicate the medical and/or surgical departments.
- `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients.
- `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.).
- `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago").
- `Height`: All mentions related to a patients height.
- `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included).
- `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity.
- `Frequency`: Frequency of administration for a dose prescribed.
- `Time`: Specific time references (hour and/or minutes).
- `Weight`: All mentions related to a patients weight.
- `Vaccine`: Generic and brand name of vaccines or vaccination procedure.
- `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient.
- `Communicable_Disease`: Includes all mentions of communicable diseases.
- `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately).
- `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure).
- `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein).
- `Total_Cholesterol`: Terms related to the lab test and results for cholesterol.
- `Smoking`: All mentions of smoking status of a patient.
- `Date`: Mentions of an exact date, in any format, including day number, month and/or year.
## Predicted Entities
`Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.2.0_2.4_1630172634235.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.2.0_2.4_1630172634235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models")\
.setInputCols("token", "sentence")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
sample_text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""
result = model.transform(spark.createDataFrame([[sample_text]]).toDF("text"))
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter))
val sample_text = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""").toDS.toDF("text")
val result = pipeline.fit(sample_text).transform(sample_text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.ner_jsl").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models')
result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models')
val result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""")
```
## Results
```bash
******************** ner_diseases_biobert Model Results ********************
[('gestational diabetes mellitus', 'Disease'), ('type two diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('HTG-induced pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('poor appetite', 'Disease'), ('vomiting', 'Disease')]
******************** ner_events_biobert Model Results ********************
[('gestational diabetes mellitus', 'PROBLEM'), ('eight years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('three years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'TEST'), ('BMI', 'TEST'), ('presented', 'OCCURRENCE'), ('a one-week', 'DURATION'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')]
******************** ner_jsl_biobert Model Results ********************
[('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Obesity'), ('body mass index', 'BMI'), ('BMI ) of 33.5 kg/m2', 'BMI'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')]
******************** ner_clinical_biobert Model Results ********************
[('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index ( BMI )', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')]
******************** ner_risk_factors_biobert Model Results ********************
[('diabetes mellitus', 'DIABETES'), ('subsequent type two diabetes mellitus', 'DIABETES'), ('obesity', 'OBESE')]
...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_profiling_biobert|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.3.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|750.1 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- Finisher
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from rahulchakwate)
author: John Snow Labs
name: distilbert_qa_base_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-finetuned-squad` is a English model originally trained by `rahulchakwate`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_finetuned_squad_en_4.3.0_3.0_1672767084899.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_finetuned_squad_en_4.3.0_3.0_1672767084899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/rahulchakwate/distilbert-base-finetuned-squad
---
layout: model
title: English RobertaForSequenceClassification Cased model (from lucianpopa)
author: John Snow Labs
name: roberta_classifier_autonlp_trec_classification_522314623
date: 2022-12-09
tags: [en, open_source, roberta, sequence_classification, classification]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-TREC-classification-522314623` is a English model originally trained by `lucianpopa`.
## Predicted Entities
`1`, `0`, `4`, `2`, `3`, `5`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_trec_classification_522314623_en_4.2.4_3.0_1670622157916.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_trec_classification_522314623_en_4.2.4_3.0_1670622157916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_trec_classification_522314623","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier])
data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_trec_classification_522314623","en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier))
val data = Seq("I love you!").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_classifier_autonlp_trec_classification_522314623|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/lucianpopa/autonlp-TREC-classification-522314623
---
layout: model
title: Basic General Purpose Pipeline for Catalan
author: cayorodriguez
name: pipeline_md
date: 2022-07-11
tags: [ca, open_source]
task: [Named Entity Recognition, Sentence Detection, Embeddings, Stop Words Removal, Part of Speech Tagging, Lemmatization, Chunk Mapping, Pipeline Public]
language: ca
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: false
recommended: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Model for Catalan language processing based on models by Barcelona SuperComputing Center and the AINA project (Generalitat de Catalunya), following POS and tokenization guidelines from ANCORA Universal Dependencies corpus.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/cayorodriguez/pipeline_md_ca_3.4.4_3.0_1657533114488.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/cayorodriguez/pipeline_md_ca_3.4.4_3.0_1657533114488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("pipeline_md", "ca", "@cayorodriguez")
result = pipeline.annotate("El català ja és a SparkNLP.")
```
## Results
```bash
{'chunk': ['El català ja', 'SparkNLP', 'és'],
'entities': ['SparkNLP'],
'lemma': ['el', 'català', 'ja', 'ser', 'a', 'sparknlp', '.'],
'document': ['El català ja es a SparkNLP.'],
'pos': ['DET', 'NOUN', 'ADV', 'AUX', 'ADP', 'PROPN', 'PUNCT'],
'sentence_embeddings': ['El català ja és a SparkNLP.'],
'cleanTokens': ['català', 'SparkNLP', '.'],
'token': ['El', 'català', 'ja', 'és', 'a', 'SparkNLP', '.'],
'ner': ['O', 'O', 'O', 'O', 'O', 'B-ORG', 'O'],
'embeddings': ['El', 'català', 'ja', 'és', 'a', 'SparkNLP', '.'],
'form': ['el', 'català', 'ja', 'és', 'a', 'sparknlp', '.'],
'sentence': ['El català ja és a SparkNLP.']}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_md|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Community|
|Language:|ca|
|Size:|756.1 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- NormalizerModel
- StopWordsCleaner
- RoBertaEmbeddings
- SentenceEmbeddings
- EmbeddingsFinisher
- LemmatizerModel
- PerceptronModel
- RoBertaForTokenClassification
- NerConverter
- Chunker
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from huangtuoyue)
author: John Snow Labs
name: distilbert_qa_huangtuoyue_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `huangtuoyue`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_huangtuoyue_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771314379.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_huangtuoyue_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771314379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huangtuoyue_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huangtuoyue_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_huangtuoyue_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/huangtuoyue/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Fast Neural Machine Translation Model from Belarusian to Spanish
author: John Snow Labs
name: opus_mt_be_es
date: 2021-06-01
tags: [open_source, seq2seq, translation, be, es, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: be
target languages: es
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_be_es_xx_3.1.0_2.4_1622557108215.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_be_es_xx_3.1.0_2.4_1622557108215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_be_es", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_be_es", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Belarusian.translate_to.Spanish').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_be_es|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Korean Bert Embeddings (from deeq)
author: John Snow Labs
name: bert_embeddings_dbert
date: 2022-04-11
tags: [bert, embeddings, ko, open_source]
task: Embeddings
language: ko
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `dbert` is a Korean model orginally trained by `deeq`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_dbert_ko_3.4.2_3.0_1649675591519.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_dbert_ko_3.4.2_3.0_1649675591519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_dbert","ko") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_dbert","ko")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.embed.dbert").predict("""나는 Spark NLP를 좋아합니다""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_dbert|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ko|
|Size:|424.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/deeq/dbert
---
layout: model
title: German T5ForConditionalGeneration Base Cased model (from Einmalumdiewelt)
author: John Snow Labs
name: t5_base_gnad
date: 2023-01-30
tags: [de, open_source, t5, tensorflow]
task: Text Generation
language: de
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5-Base_GNAD` is a German model originally trained by `Einmalumdiewelt`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_gnad_de_4.3.0_3.0_1675099176903.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_gnad_de_4.3.0_3.0_1675099176903.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_base_gnad","de") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_base_gnad","de")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_base_gnad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|de|
|Size:|919.7 MB|
## References
- https://huggingface.co/Einmalumdiewelt/T5-Base_GNAD
---
layout: model
title: Spanish Lemmatizer
author: John Snow Labs
name: lemma
date: 2020-02-17 00:16:00 +0800
task: Lemmatization
language: es
edition: Spark NLP 2.4.0
spark_version: 2.4
tags: [lemmatizer, es]
supported: true
annotator: LemmatizerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_es_2.4.0_2.4_1581890818386.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_es_2.4.0_2.4_1581890818386.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
lemmatizer = LemmatizerModel.pretrained("lemma", "es") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica.")
```
```scala
...
val lemmatizer = LemmatizerModel.pretrained("lemma", "es")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica."""]
lemma_df = nlu.load('es.lemma').predict(text, output_level = "token")
lemma_df.lemma.values[0]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=5, result='Además', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=7, end=8, result='de', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=10, end=12, result='ser', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=14, end=15, result='el', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=17, end=19, result='rey', metadata={'sentence': '0'}, embeddings=[]),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Type:|lemmatizer|
|Compatibility:|Spark NLP 2.4.0|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[lemma]|
|Language:|es|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: English BertForQuestionAnswering Cased model (from motiondew)
author: John Snow Labs
name: bert_qa_set_date_2_lr_2e_5_bs_32_ep_4
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_2-lr-2e-5-bs-32-ep-4` is a English model originally trained by `motiondew`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_2_lr_2e_5_bs_32_ep_4_en_4.0.0_3.0_1657188461900.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_2_lr_2e_5_bs_32_ep_4_en_4.0.0_3.0_1657188461900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_2_lr_2e_5_bs_32_ep_4","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_2_lr_2e_5_bs_32_ep_4","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_set_date_2_lr_2e_5_bs_32_ep_4|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/motiondew/bert-set_date_2-lr-2e-5-bs-32-ep-4
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18)
author: John Snow Labs
name: distilbert_qa_base_uncased_becasv2_1
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becasv2-1` is a English model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_1_en_4.3.0_3.0_1672767657809.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_1_en_4.3.0_3.0_1672767657809.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_becasv2_1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Evelyn18/distilbert-base-uncased-becasv2-1
---
layout: model
title: Translate English to Turkic languages Pipeline
author: John Snow Labs
name: translate_en_trk
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, trk, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `trk`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_trk_xx_2.7.0_2.4_1609690153382.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_trk_xx_2.7.0_2.4_1609690153382.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_trk", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_trk", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.trk').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_trk|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: German asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295` is a German model originally trained by jonatasgrosman.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295_de_4.2.0_3.0_1664114711185.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295_de_4.2.0_3.0_1664114711185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295', lang = 'de')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295", lang = "de")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English BertForQuestionAnswering model (from MarcBrun)
author: John Snow Labs
name: bert_qa_ixambert_finetuned_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ixambert-finetuned-squad` is a English model orginally trained by `MarcBrun`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ixambert_finetuned_squad_en_4.0.0_3.0_1654187989433.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ixambert_finetuned_squad_en_4.0.0_3.0_1654187989433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ixambert_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_ixambert_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.ixam_bert.by_MarcBrun").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_ixambert_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|661.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/MarcBrun/ixambert-finetuned-squad
---
layout: model
title: English BertForQuestionAnswering Cased model (from akmal2500)
author: John Snow Labs
name: bert_qa_akmal2500_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `akmal2500`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_akmal2500_finetuned_squad_en_4.0.0_3.0_1657186331020.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_akmal2500_finetuned_squad_en_4.0.0_3.0_1657186331020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_akmal2500_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_akmal2500_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_akmal2500_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/akmal2500/bert-finetuned-squad
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0_en_4.0.0_3.0_1655733244827.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0_en_4.0.0_3.0_1655733244827.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_64d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|419.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-0
---
layout: model
title: Legal Qualifications Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_qualifications_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, qualifications, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Qualifications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Qualifications`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_qualifications_bert_en_1.0.0_3.0_1678049907588.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_qualifications_bert_en_1.0.0_3.0_1678049907588.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Qualifications]|
|[Other]|
|[Other]|
|[Qualifications]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_qualifications_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.8 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.83 0.83 0.83 6
Qualifications 0.83 0.83 0.83 6
accuracy - - 0.83 12
macro-avg 0.83 0.83 0.83 12
weighted-avg 0.83 0.83 0.83 12
```
---
layout: model
title: Resolver Company Names to Tickers using Nasdaq Stock Screener
author: John Snow Labs
name: finel_nasdaq_ticker_stock_screener
date: 2023-01-20
tags: [en, licensed, finance, nasdaq, ticker]
task: Entity Resolution
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is an Entity Resolution / Entity Linking model, which is able to provide Ticker / Trading Symbols using a Company Name as an input. You can use any NER which extracts Organizations / Companies / Parties to then send the input to `finel_nasdaq_company_name_stock_screener` model to get normalized company name. Finally, this Entity Linking model get the Ticker / Trading Symbol (given the company has one).
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_ticker_stock_screener_en_1.0.0_3.0_1674236954508.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_ticker_stock_screener_en_1.0.0_3.0_1674236954508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
ner_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained('finner_orgs_prods_alias', 'en', 'finance/models')\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
chunkToDoc = nlp.Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
ticker_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en")\
.setInputCols("ner_chunk_doc")\
.setOutputCol("ticker_embeddings")
er_ticker_model = finance.SentenceEntityResolverModel.pretrained('finel_nasdaq_ticker_stock_screener', 'en', 'finance/model')\
.setInputCols(["ticker_embeddings"])\
.setOutputCol("ticker")\
.setAuxLabelCol("company_name")
pipeline = nlp.Pipeline().setStages([document_assembler,
tokenizer,
ner_embeddings,
ner_model,
ner_converter,
chunkToDoc,
ticker_embeddings,
er_ticker_model])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
lp = nlp.LightPipeline(model)
text = """Nike is an American multinational association that is involved in the design, development, manufacturing and worldwide marketing and sales of apparel, footwear, accessories, equipment and services."""
result = lp.annotate(text)
result["ticker"]
```
## Results
```bash
['NKE']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finel_nasdaq_ticker_stock_screener|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[normalized]|
|Language:|en|
|Size:|54.6 MB|
|Case sensitive:|false|
## References
https://www.nasdaq.com/market-activity/stocks/screener
---
layout: model
title: Korean Electra Embeddings (from krevas)
author: John Snow Labs
name: electra_embeddings_finance_koelectra_base_generator
date: 2022-05-17
tags: [ko, open_source, electra, embeddings]
task: Embeddings
language: ko
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finance-koelectra-base-generator` is a Korean model orginally trained by `krevas`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_finance_koelectra_base_generator_ko_3.4.4_3.0_1652786802248.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_finance_koelectra_base_generator_ko_3.4.4_3.0_1652786802248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_finance_koelectra_base_generator","ko") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_finance_koelectra_base_generator","ko")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_finance_koelectra_base_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|ko|
|Size:|129.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/krevas/finance-koelectra-base-generator
- https://openreview.net/forum?id=r1xMH1BtvB
- https://github.com/google-research/electra
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_test TFWav2Vec2ForCTC from ying-tina
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab_test
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_test` is a English model originally trained by ying-tina.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_test_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_test_en_4.2.0_3.0_1664111957682.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_test_en_4.2.0_3.0_1664111957682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab_test", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab_test", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab_test|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|354.9 MB|
---
layout: model
title: Sinhala RobertaForMaskedLM Cased model (from keshan)
author: John Snow Labs
name: roberta_embeddings_sinhalaberto
date: 2022-12-12
tags: [si, open_source, roberta_embeddings, robertaformaskedlm]
task: Embeddings
language: si
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SinhalaBERTo` is a Sinhala model originally trained by `keshan`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_sinhalaberto_si_4.2.4_3.0_1670858534410.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_sinhalaberto_si_4.2.4_3.0_1670858534410.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_sinhalaberto","si") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_sinhalaberto","si")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_sinhalaberto|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|si|
|Size:|314.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/keshan/SinhalaBERTo
- https://oscar-corpus.com/
- https://arxiv.org/abs/1907.11692
---
layout: model
title: Fast Neural Machine Translation Model from Baltic Languages to English
author: John Snow Labs
name: opus_mt_bat_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, bat, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `bat`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bat_en_xx_2.7.0_2.4_1609164652074.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bat_en_xx_2.7.0_2.4_1609164652074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_bat_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_bat_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.bat.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_bat_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Clinical Analysis
author: John Snow Labs
name: clinical_analysis
class: PipelineModel
language: en
nav_key: models
repository: clinical/models
date: 2020-02-01
task: Pipeline Healthcare
edition: Healthcare NLP 2.4.0
spark_version: 2.4
tags: [clinical,licensed,pipeline,en]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_analysis_en_2.4.0_2.4_1580600773378.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_analysis_en_2.4.0_2.4_1580600773378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
model = PretrainedPipeline("clinical_analysis","en","clinical/models")
```
```scala
val model = PipelineModel.pretrained("clinical_analysis","en","clinical/models")
```
{:.model-param}
## Model Information
{:.table-model}
|---------------|-------------------|
| Name: | clinical_analysis |
| Type: | PipelineModel |
| Compatibility: | Spark NLP 2.4.0+ |
| License: | Licensed |
| Edition: | Official |
| Language: | en |
{:.h2_title}
## Data Source
---
layout: model
title: English asr_Fine_Tunning_on_CV_dataset TFWav2Vec2ForCTC from Sania67
author: John Snow Labs
name: pipeline_asr_Fine_Tunning_on_CV_dataset
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Fine_Tunning_on_CV_dataset` is a English model originally trained by Sania67.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Fine_Tunning_on_CV_dataset_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Fine_Tunning_on_CV_dataset_en_4.2.0_3.0_1664118340149.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Fine_Tunning_on_CV_dataset_en_4.2.0_3.0_1664118340149.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_Fine_Tunning_on_CV_dataset', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_Fine_Tunning_on_CV_dataset", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_Fine_Tunning_on_CV_dataset|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Nonstatutory Stock Option Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_nonstatutory_stock_option_agreement_bert
date: 2023-02-02
tags: [en, legal, classification, nonstatutory, stock, option, agreement, licensed, bert, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_nonstatutory_stock_option_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `nonstatutory-stock-option-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`nonstatutory-stock-option-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_nonstatutory_stock_option_agreement_bert_en_1.0.0_3.0_1675360953793.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_nonstatutory_stock_option_agreement_bert_en_1.0.0_3.0_1675360953793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[nonstatutory-stock-option-agreement]|
|[other]|
|[other]|
|[nonstatutory-stock-option-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_nonstatutory_stock_option_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
nonstatutory-stock-option-agreement 0.98 0.96 0.97 53
other 0.98 0.99 0.99 122
accuracy - - 0.98 175
macro-avg 0.98 0.98 0.98 175
weighted-avg 0.98 0.98 0.98 175
```
---
layout: model
title: Chinese BertForMaskedLM Cased model (from hfl)
author: John Snow Labs
name: bert_embeddings_rbtl3
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbtl3` is a Chinese model originally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbtl3_zh_4.2.4_3.0_1670022905492.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbtl3_zh_4.2.4_3.0_1670022905492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_rbtl3|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|228.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/rbtl3
- https://arxiv.org/abs/1906.08101
- https://github.com/google-research/bert
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/1906.08101
---
layout: model
title: Finnish T5ForConditionalGeneration Tiny Cased model (from Finnish-NLP)
author: John Snow Labs
name: t5_tiny_nl6
date: 2023-01-31
tags: [fi, open_source, t5, tensorflow]
task: Text Generation
language: fi
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-tiny-nl6-finnish` is a Finnish model originally trained by `Finnish-NLP`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_tiny_nl6_fi_4.3.0_3.0_1675156113232.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_tiny_nl6_fi_4.3.0_3.0_1675156113232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_tiny_nl6","fi") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_tiny_nl6","fi")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_tiny_nl6|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|fi|
|Size:|145.8 MB|
## References
- https://huggingface.co/Finnish-NLP/t5-tiny-nl6-finnish
- https://arxiv.org/abs/1910.10683
- https://github.com/google-research/text-to-text-transfer-transformer
- https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511
- https://arxiv.org/abs/2002.05202
- https://arxiv.org/abs/2109.10686
- http://urn.fi/urn:nbn:fi:lb-2017070501
- http://urn.fi/urn:nbn:fi:lb-2021050401
- http://urn.fi/urn:nbn:fi:lb-2018121001
- http://urn.fi/urn:nbn:fi:lb-2020021803
- https://sites.research.google/trc/about/
- https://github.com/google-research/t5x
- https://github.com/spyysalo/yle-corpus
- https://github.com/aajanki/eduskunta-vkk
- https://sites.research.google/trc/
- https://www.linkedin.com/in/aapotanskanen/
- https://www.linkedin.com/in/rasmustoivanen/
---
layout: model
title: Fast Neural Machine Translation Model from English to Luvale
author: John Snow Labs
name: opus_mt_en_lue
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, lue, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `lue`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_lue_xx_2.7.0_2.4_1609168026109.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_lue_xx_2.7.0_2.4_1609168026109.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_lue", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_lue", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.lue').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_lue|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for SNOMED codes (procedures and measurements)
author: John Snow Labs
name: sbiobertresolve_clinical_snomed_procedures_measurements
date: 2021-11-15
tags: [en, licensed, clinical, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.2
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps medical entities to SNOMED codes using `sent_biobert_clinical_base_cased` Sentence Bert Embeddings. The corpus of this model includes `Procedures` and `Measurement` domains.
## Predicted Entities
`SNOMED` codes from `Procedures` and `Measurements`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_snomed_procedures_measurements_en_3.3.2_3.0_1636985738813.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_snomed_procedures_measurements_en_3.3.2_3.0_1636985738813.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
| | chunk | code | code_description | all_k_code_desc | all_k_codes |
|---:|:-----------------------|----------:|:------------------------------|:--------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | coronary calcium score | 450360000 | Coronary artery calcium score | ['450360000', '450734004', '1086491000000104', '1086481000000101', '762241007'] | ['Coronary artery calcium score', 'Coronary artery calcium score', 'Dundee Coronary Risk Disk score', 'Dundee Coronary Risk rank', 'Dundee Coronary Risk Disk'] |
| 1 | heart surgery | 2598006 | Open heart surgery | ['2598006', '64915003', '119766003', '34068001', '233004008'] | ['Open heart surgery', 'Operation on heart', 'Heart reconstruction', 'Heart valve replacement', 'Coronary sinus operation'] |
| 2 | ct scan | 303653007 | CT of head | ['303653007', '431864000', '363023007', '418272005', '241577003'] | ['CT of head', 'CT guided injection', 'CT of site', 'CT angiography', 'CT of spine'] |
| 3 | bp value | 75367002 | Blood pressure | ['75367002', '6797001', '723232008', '46973005', '427732000'] | ['Blood pressure', 'Mean blood pressure', 'Average blood pressure', 'Blood pressure taking', 'Speed of blood pressure response'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_clinical_snomed_procedures_measurements|
|Compatibility:|Healthcare NLP 3.3.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_chunk_embeddings]|
|Output Labels:|[output]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on `SNOMED` code dataset with `sent_biobert_clinical_base_cased` sentence embeddings.
---
layout: model
title: Legal Representations And Warranties Clause Binary Classifier
author: John Snow Labs
name: legclf_representations_and_warranties_clause
date: 2022-12-18
tags: [en, legal, classification, licensed, clause, bert, representations, and, warranties, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the representations-and-warranties clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`representations-and-warranties`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_representations_and_warranties_clause_en_1.0.0_3.0_1671393649249.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_representations_and_warranties_clause_en_1.0.0_3.0_1671393649249.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[representations-and-warranties]|
|[other]|
|[other]|
|[representations-and-warranties]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_representations_and_warranties_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.92 0.90 0.91 39
representations-and-warranties 0.86 0.89 0.88 28
accuracy - - 0.9 67
macro-avg 0.89 0.90 0.89 67
weighted-avg 0.90 0.90 0.9 67
```
---
layout: model
title: Hindi BertForMaskedLM Cased model (from neuralspace-reverie)
author: John Snow Labs
name: bert_embeddings_indic_transformers
date: 2022-12-06
tags: [hi, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: hi
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-hi-bert` is a Hindi model originally trained by `neuralspace-reverie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_hi_4.2.4_3.0_1670326624051.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_hi_4.2.4_3.0_1670326624051.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","hi") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","hi")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_indic_transformers|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|hi|
|Size:|612.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/neuralspace-reverie/indic-transformers-hi-bert
- https://oscar-corpus.com/
---
layout: model
title: Italian T5ForConditionalGeneration Small Cased model (from it5)
author: John Snow Labs
name: t5_it5_efficient_small_el32_repubblica_to_ilgiornale
date: 2023-01-30
tags: [it, open_source, t5, tensorflow]
task: Text Generation
language: it
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-repubblica-to-ilgiornale` is a Italian model originally trained by `it5`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_repubblica_to_ilgiornale_it_4.3.0_3.0_1675103650043.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_repubblica_to_ilgiornale_it_4.3.0_3.0_1675103650043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_repubblica_to_ilgiornale","it") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_repubblica_to_ilgiornale","it")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_it5_efficient_small_el32_repubblica_to_ilgiornale|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|it|
|Size:|594.0 MB|
## References
- https://huggingface.co/it5/it5-efficient-small-el32-repubblica-to-ilgiornale
- https://github.com/stefan-it
- https://arxiv.org/abs/2203.03759
- https://gsarti.com
- https://malvinanissim.github.io
- https://arxiv.org/abs/2109.10686
- https://github.com/gsarti/it5
- https://paperswithcode.com/sota?task=Headline+style+transfer+%28Repubblica+to+Il+Giornale%29&dataset=CHANGE-IT
---
layout: model
title: Sentence Entity Resolver for NDC (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_ndc
date: 2021-11-27
tags: [ndc, entity_resolution, licensed, en, cilnical]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.2
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts (like drugs/ingredients) to [National Drug Codes](https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory) using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, if a drug has more than one NDC code, it returns all other codes in the all_k_aux_label column separated by `|` symbol.
## Predicted Entities
`NDC Codes`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_NDC/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_ndc_en_3.3.2_2.4_1638010818380.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_ndc_en_3.3.2_2.4_1638010818380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_ndc``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_posology_greedy``` as NER model. ```DRUG``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
c2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sentence_embeddings")
ndc_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_ndc", "en", "clinical/models") \
.setInputCols(["ner_chunk", "sentence_embeddings"]) \
.setOutputCol("ndc_code")\
.setDistanceFunction("EUCLIDEAN")\
.setCaseSensitive(False)
resolver_pipeline = Pipeline(
stages = [
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
posology_ner,
ner_converter_icd,
c2doc,
sbert_embedder,
ndc_resolver
])
data = spark.createDataFrame([["""The patient was transferred secondary to inability and continue of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given aspirin 81 mg, folic acid 1 g daily, insulin glargine 100 UNT/ML injection and metformin 500 mg p.o. p.r.n."""]]).toDF("text")
result = resolver_pipeline.fit(data).transform(data)
```
```scala
...
val c2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli", "en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sentence_embeddings")
val ndc_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_ndc", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sentence_embeddings"))
.setOutputCol("ndc_code")
.setDistanceFunction("EUCLIDEAN")
.setCaseSensitive(False)
val resolver_pipeline = new Pipeline().setStages(Array(
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
posology_ner,
ner_converter_icd,
c2doc,
sbert_embedder,
ndc_resolver
))
val clinical_note = Seq("""The patient was transferred secondary to inability and continue of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given aspirin 81 mg, folic acid 1 g daily, insulin glargine 100 UNT/ML injection and metformin 500 mg p.o. p.r.n.""").toDS.toDF("text")
val result = resolver_pipeline.fit(clinical_note).transform(clinical_note)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.ndc").predict("""The patient was transferred secondary to inability and continue of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given aspirin 81 mg, folic acid 1 g daily, insulin glargine 100 UNT/ML injection and metformin 500 mg p.o. p.r.n.""")
```
## Results
```bash
+-------------------------------------+------+-----------+------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ner_chunk|entity| ndc_code| description| all_codes| all_resolutions| other ndc codes|
+-------------------------------------+------+-----------+------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| aspirin 81 mg| DRUG|73089008114| aspirin 81 mg/81mg, 81 mg in 1 carton , capsule|[73089008114, 71872708704, 71872715401, 68210101500, 69536028110, 63548086706, 71679001000, 68196090051, 00113400500, 69536018112, 73089008112, 63981056362, 63739043402, 63548086705, 00113046708, 7...|[aspirin 81 mg/81mg, 81 mg in 1 carton , capsule, aspirin 81 mg 81 mg/1, 4 blister pack in 1 bag , tablet, aspirin 81 mg/1, 1 blister pack in 1 bag , tablet, coated, aspirin 81 mg/1, 1 bag in 1 dru...| [-, -, -, -, -, -, -, -, -, -, -, 63940060962, -, -, -, -, -, -, -, -, 70000042002|00363021879|41250027408|36800046708|59779027408|49035027408|71476010131|81522046708|30142046708, -, -, -, -]|
| folic acid 1 g| DRUG|43744015101| folic acid 1 g/g, 1 g in 1 package , powder|[43744015101, 63238340000, 66326050555, 51552041802, 51552041805, 63238340001, 81919000204, 51552041804, 66326050556, 51552106301, 51927003300, 71092997701, 51927296300, 51552146602, 61281900002, 6...|[folic acid 1 g/g, 1 g in 1 package , powder, folic acid 1 kg/kg, 1 kg in 1 bottle , powder, folic acid 1 kg/kg, 1 kg in 1 drum , powder, folic acid 1 g/g, 5 g in 1 container , powder, folic acid 1...| [-, -, -, -, -, -, -, -, -, -, -, 51552139201, -, -, -, 81919000203, -, 81919000201, -, -, -, -, -, -, -]|
|insulin glargine 100 UNT/ML injection| DRUG|00088502101|insulin glargine 100 [iu]/ml, 1 vial, glass in 1 package , injection, solution|[00088502101, 00088222033, 49502019580, 00002771563, 00169320111, 00088250033, 70518139000, 00169266211, 50090127600, 50090407400, 00002771559, 00002772899, 70518225200, 70518138800, 00024592410, 0...|[insulin glargine 100 [iu]/ml, 1 vial, glass in 1 package , injection, solution, insulin glargine 100 [iu]/ml, 1 vial, glass in 1 carton , injection, solution, insulin glargine 100 [iu]/ml, 1 vial ...|[-, -, -, 00088221900, -, -, 50090139800|00088502005, -, 70518146200|00169368712, 00169368512|73070020011, 00088221905|49502019675|50090406800, -, 73070010011|00169750111|50090495500, 66733077301|0...|
| metformin 500 mg| DRUG|70010006315| metformin hydrochloride 500 mg/500mg, 500 mg in 1 drum , tablet|[70010006315, 62207041613, 71052050750, 62207049147, 71052091050, 25000010197, 25000013498, 25000010198, 71052063005, 51662139201, 70010049118, 70882012456, 71052011005, 71052065905, 71052050850, 1...|[metformin hydrochloride 500 mg/500mg, 500 mg in 1 drum , tablet, metformin hcl 500 mg/kg, 50 kg in 1 drum , powder, 5-fluorouracil 500 g/500g, 500 g in 1 container , powder, metformin er 500 mg 50...| [-, -, -, 70010049105, -, -, -, -, -, -, -, -, -, -, -, 71800000801|42571036007, -, -, -, -, -, -, -, -, -]|
+-------------------------------------+------+-----------+------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_ndc|
|Compatibility:|Healthcare NLP 3.3.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[ndc_code]|
|Language:|en|
|Case sensitive:|false|
---
layout: model
title: English asr_sanskrit TFWav2Vec2ForCTC from Tarakki100
author: John Snow Labs
name: pipeline_asr_sanskrit
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_sanskrit` is a English model originally trained by Tarakki100.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_sanskrit_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_sanskrit_en_4.2.0_3.0_1664112399942.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_sanskrit_en_4.2.0_3.0_1664112399942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_sanskrit', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_sanskrit", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_sanskrit|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|227.9 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Aspect based Sentiment Analysis for restaurant reviews
author: John Snow Labs
name: ner_aspect_based_sentiment
date: 2020-12-29
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 2.6.2
spark_version: 2.4
tags: [sentiment, open_source, en, ner]
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Automatically detect positive, negative and neutral aspects about restaurants from user reviews. Instead of labelling the entire review as negative or positive, this model helps identify which exact phrases relate to sentiment identified in the review.
## Predicted Entities
`NEG`, `POS`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/ASPECT_BASED_SENTIMENT_RESTAURANT/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/ABSA_Inference.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_aspect_based_sentiment_en_2.6.2_2.4_1609249232812.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_aspect_based_sentiment_en_2.6.2_2.4_1609249232812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
word_embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", "xx")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained("ner_aspect_based_sentiment")\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish."]]).toDF("text"))
```
```scala
...
val word_embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", "xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("ner_aspect_based_sentiment")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter))
val data = Seq("Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish."""]
ner_df = nlu.load('en.ner.aspect_sentiment').predict(text, output_level='token')
list(zip(ner_df["entities"].values[0], ner_df["entities_confidence"].values[0])
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+-------------------+-----------+
| sentence | aspect | sentiment |
+----------------------------------------------------------------------------------------------------+-------------------+-----------+
| We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. | Thai-style main | positive |
| We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. | lots of flavours | positive |
| But the service was below average and the chips were too terrible to finish. | service | negative |
| But the service was below average and the chips were too terrible to finish. | chips | negative |
+----------------------------------------------------------------------------------------------------+-------------------+-----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_aspect_based_sentiment|
|Type:|ner|
|Compatibility:|Spark NLP 2.6.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, embeddings]|
|Output Labels:|[absa]|
|Language:|en|
|Dependencies:|glove_6B_300|
---
layout: model
title: Smaller BERT Sentence Embeddings (L-12_H-128_A-2)
author: John Snow Labs
name: sent_small_bert_L12_128
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L12_128_en_2.6.0_2.4_1598350359233.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L12_128_en_2.6.0_2.4_1598350359233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L12_128", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L12_128", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.small_bert_L12_128').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
en_embed_sentence_small_bert_L12_128_embeddings sentence
[-0.3747739791870117, -0.28460437059402466, 0.... I hate cancer
[0.9055836200714111, -0.41459062695503235, 0.0... Antibiotics aren't painkiller
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_small_bert_L12_128|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|128|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1
---
layout: model
title: Greek Lemmatizer
author: John Snow Labs
name: lemma
date: 2020-05-05 16:56:00 +0800
task: Lemmatization
language: el
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [lemmatizer, el]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_el_2.5.0_2.4_1588686951720.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_el_2.5.0_2.4_1588686951720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
lemmatizer = LemmatizerModel.pretrained("lemma", "el") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής.")
```
```scala
...
val lemmatizer = LemmatizerModel.pretrained("lemma", "el")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής."""]
lemma_df = nlu.load('el.lemma').predict(text, output_level='document')
lemma_df.lemma.values[0]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=4, result='εκτός', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=6, end=8, result='από', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=10, end=11, result='ο', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=13, end=15, result='ότι', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=17, end=21, result='είμαι', metadata={'sentence': '0'}, embeddings=[]),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Type:|lemmatizer|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[lemma]|
|Language:|el|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Legal Adjustments Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_adjustments_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, adjustments, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Adjustments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Adjustments`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_adjustments_bert_en_1.0.0_3.0_1678050029175.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_adjustments_bert_en_1.0.0_3.0_1678050029175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Adjustments]|
|[Other]|
|[Other]|
|[Adjustments]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_adjustments_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Adjustments 0.88 0.90 0.89 40
Other 0.93 0.91 0.92 58
accuracy - - 0.91 98
macro-avg 0.90 0.91 0.91 98
weighted-avg 0.91 0.91 0.91 98
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from sharonpeng)
author: John Snow Labs
name: distilbert_qa_sharonpeng_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `sharonpeng`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sharonpeng_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772605467.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sharonpeng_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772605467.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sharonpeng_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sharonpeng_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_sharonpeng_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/sharonpeng/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from google)
author: John Snow Labs
name: t5_efficient_base_ff1000
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-ff1000` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_ff1000_en_4.3.0_3.0_1675111729938.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_ff1000_en_4.3.0_3.0_1675111729938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_ff1000","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_ff1000","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_ff1000|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|330.1 MB|
## References
- https://huggingface.co/google/t5-efficient-base-ff1000
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-64-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4_en_4.0.0_3.0_1657185460089.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4_en_4.0.0_3.0_1657185460089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-64-finetuned-squad-seed-4
---
layout: model
title: Multilingual BertForQuestionAnswering model (from Paul-Vinh)
author: John Snow Labs
name: bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-squad` is a Multilingual model orginally trained by `Paul-Vinh`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad_xx_4.0.0_3.0_1654180153562.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad_xx_4.0.0_3.0_1654180153562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad","xx") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad","xx")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.answer_question.squad.bert.multilingual_base_cased.by_Paul-Vinh").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Paul-Vinh/bert-base-multilingual-cased-finetuned-squad
---
layout: model
title: Portuguese BertForMaskedLM Base Cased model (from Geotrend)
author: John Snow Labs
name: bert_embeddings_base_pt_cased
date: 2022-12-02
tags: [pt, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: pt
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-pt-cased` is a Portuguese model originally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_pt_cased_pt_4.2.4_3.0_1670018755967.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_pt_cased_pt_4.2.4_3.0_1670018755967.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_pt_cased","pt") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_pt_cased","pt")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_pt_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|pt|
|Size:|395.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-pt-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Arabic Bert Embeddings (Large, Arabert Model, v02)
author: John Snow Labs
name: bert_embeddings_bert_large_arabertv02
date: 2022-04-11
tags: [bert, embeddings, ar, open_source]
task: Embeddings
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-arabertv02` is a Arabic model orginally trained by `aubmindlab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabertv02_ar_3.4.2_3.0_1649677496517.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabertv02_ar_3.4.2_3.0_1649677496517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabertv02","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabertv02","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.bert_large_arabertv02").predict("""أنا أحب شرارة NLP""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ar_cased","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ar_cased","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.distilbert").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_distilbert_base_ar_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|182.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/distilbert-base-ar-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Hindi XLMRobertaForTokenClassification Large Cased model (from cfilt)
author: John Snow Labs
name: xlmroberta_ner_hiner_original_large
date: 2022-08-13
tags: [hi, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: hi
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `HiNER-original-xlm-roberta-large` is a Hindi model originally trained by `cfilt`.
## Predicted Entities
`GAME`, `MISC`, `ORGANIZATION`, `FESTIVAL`, `LOCATION`, `LITERATURE`, `LANGUAGE`, `NUMEX`, `PERSON`, `RELIGION`, `TIMEX`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_hiner_original_large_hi_4.1.0_3.0_1660406213542.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_hiner_original_large_hi_4.1.0_3.0_1660406213542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_hiner_original_large","hi") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_hiner_original_large","hi")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_hiner_original_large|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|hi|
|Size:|1.8 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/cfilt/HiNER-original-xlm-roberta-large
- https://paperswithcode.com/sota?task=Token+Classification&dataset=HiNER+Original
---
layout: model
title: Legal Bereavement leave Clause Binary Classifier
author: John Snow Labs
name: legclf_bereavement_leave_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `bereavement-leave` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `bereavement-leave`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bereavement_leave_clause_en_1.0.0_3.2_1660123269138.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bereavement_leave_clause_en_1.0.0_3.2_1660123269138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[bereavement-leave]|
|[other]|
|[other]|
|[bereavement-leave]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_bereavement_leave_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
bereavement-leave 1.00 1.00 1.00 30
other 1.00 1.00 1.00 76
accuracy - - 1.00 106
macro-avg 1.00 1.00 1.00 106
weighted-avg 1.00 1.00 1.00 106
```
---
layout: model
title: Medical Question Answering (biogpt)
author: John Snow Labs
name: biogpt_pubmed_qa
date: 2023-02-26
tags: [licensed, en, clinical, biogpt, gpt, pubmed, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Healthcare NLP 4.3.0
spark_version: 3.0
published: false
engine: tensorflow
annotator: MedicalQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model has been trained with medical documents and can generate two types of answers, short and long.
Types of questions are supported: `"short"` (producing yes/no/maybe) answers and `"full"` (long answers).
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/biogpt_pubmed_qa_en_4.3.0_3.0_1677406773484.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/biogpt_pubmed_qa_en_4.3.0_3.0_1677406773484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler()\
.setInputCols("question", "context")\
.setOutputCols("document_question", "document_context")
med_qa = MedicalQuestionAnswering.pretrained("medical_qa_biogpt","en","clinical/models")\
.setInputCols(["document_question", "document_context"])\
.setMaxNewTokens(100)\
.setOutputCol("answer")\
.setQuestionType("short") #long
pipeline = Pipeline(stages=[document_assembler, med_qa])
paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65-97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene. Recent eye-tracking studies have supported this hypothesis by showing that people tend to look at empty places where requested information has been previously presented. However, it has remained unclear to what extent this behavior is related to memory performance. The aim of the present study was to explore whether the manipulation of spatial attention can facilitate memory retrieval. In two experiments, participants were asked first to memorize a set of four objects and then to determine whether a probe word referred to any of the objects. The results of both experiments indicate that memory accuracy is not affected by the current focus of attention and that all the effects of directing attention to specific locations on response times can be explained in terms of stimulus-stimulus and stimulus-response spatial compatibility."
long_question = "What is the effect of directing attention on memory?"
yes_no_question = "Does directing attention improve memory for items?"
data = spark.createDataFrame(
[
[long_question, paper_abstract, "long"],
[yes_no_question, paper_abstract, "short"],
]
).toDF("question", "context", "question_type")
pipeline.fit(data).transform(data.where("question_type == 'long'"))\
.select("answer.result")\
.show(truncate=False)
pipeline.fit(data).transform(data.where("question_type == 'short'"))\
.select("answer.result")\
.show(truncate=False)
```
```scala
val document_assembler = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val med_qa = MedicalQuestionAnswering.pretrained("medical_qa_biogpt","en","clinical/models")
.setInputCols(Array("document_question", "document_context"))
.setMaxNewTokens(100)
.setOutputCol("answer")
.setQuestionType("short") #long
val pipeline = new Pipeline().setStages(Array(document_assembler, med_qa))
paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65-97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene. Recent eye-tracking studies have supported this hypothesis by showing that people tend to look at empty places where requested information has been previously presented. However, it has remained unclear to what extent this behavior is related to memory performance. The aim of the present study was to explore whether the manipulation of spatial attention can facilitate memory retrieval. In two experiments, participants were asked first to memorize a set of four objects and then to determine whether a probe word referred to any of the objects. The results of both experiments indicate that memory accuracy is not affected by the current focus of attention and that all the effects of directing attention to specific locations on response times can be explained in terms of stimulus-stimulus and stimulus-response spatial compatibility."
long_question = "What is the effect of directing attention on memory?"
yes_no_question = "Does directing attention improve memory for items?"
val data = Seq(
(long_question, paper_abstract,"long" ),
(yes_no_question, paper_abstract, "short"))
.toDS.toDF("question", "context", "question_type")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
#short result
+------+
|result|
+------+
|[no] |
+------+
#long result
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[the results of the two experiments suggest that the visual indexeing theory does not fully explain the effects that spatial attention has on memory.]|
+------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|biogpt_pubmed_qa|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.1 GB|
|Case sensitive:|true|
---
layout: model
title: Detect Clinical Entities (ner_jsl)
author: John Snow Labs
name: ner_jsl_en
date: 2020-04-22
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.4.2
spark_version: 2.4
tags: [ner, en, clinical, licensed]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
Definitions of Predicted Entities:
- `Age`: All mention of ages, past or present, related to the patient or with anybody else.
- `Temperature`: All mentions that refer to body temperature.
- `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements.
- `Procedure`: All mentions of invasive medical or surgical procedures or treatments.
- `Symptom`: All the symptoms mentioned in the document, of a patient or someone else.
- `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs).
- `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted.
- `Allergen`: Allergen related extractions mentioned in the document.
- `Gender`: Gender-specific nouns and pronouns.
- `Pulse`: Peripheral heart rate, without advanced information like measurement location.
- `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately.
- `Frequency`: Frequency of administration for a dose prescribed.
- `Weight`: All mentions related to a patients weight.
- `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Respiration`: Number of breaths per minute.
## Predicted Entities
`Diagnosis`, `Procedure_Name`, `Lab_Result`, `Procedure`, `Procedure_Findings`, `O2_Saturation`, `Procedure_incident_description`, `Dosage`, `Causative_Agents_(Virus_and_Bacteria)`, `Name`, `Cause_of_death`, `Substance_Name`, `Weight`, `Symptom_Name`, `Maybe`, `Modifier`, `Blood_Pressure`, `Frequency`, `Gender`, `Drug_incident_description`, `Age`, `Drug_Name`, `Temperature`, `Section_Name`, `Route`, `Negation`, `Negated`, `Allergenic_substance`, `Lab_Name`, `Respiratory_Rate`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_2.4.2_2.4_1587513304751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_2.4.2_2.4_1587513304751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %}
```python
...
embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"]))
```
```scala
...
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_chinese_roberta_base_upos","zh") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_chinese_roberta_base_upos","zh")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.pos.chinese_roberta_base_upos").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_chinese_roberta_base_upos|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|zh|
|Size:|381.8 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/KoichiYasuoka/chinese-roberta-base-upos
- https://universaldependencies.org/u/pos/
- https://github.com/KoichiYasuoka/esupar
---
layout: model
title: Legal Labor matters Clause Binary Classifier
author: John Snow Labs
name: legclf_labor_matters_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `labor-matters` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `labor-matters`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_labor_matters_clause_en_1.0.0_3.2_1660123653618.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_labor_matters_clause_en_1.0.0_3.2_1660123653618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_base_irish_legal|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|gle|
|Size:|415.9 MB|
|Case sensitive:|true|
## References
https://huggingface.co/joelito/legal-irish-roberta-base
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from allenai)
author: John Snow Labs
name: t5_small_squad11
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-squad11` is a English model originally trained by `allenai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_squad11_en_4.3.0_3.0_1675155640438.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_squad11_en_4.3.0_3.0_1675155640438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_small_squad11","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_small_squad11","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_small_squad11|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|148.2 MB|
## References
- https://huggingface.co/allenai/t5-small-squad11
---
layout: model
title: English BertForQuestionAnswering model (from DaisyMak)
author: John Snow Labs
name: bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate-10epoch_transformerfrozen` is a English model orginally trained by `DaisyMak`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen_en_4.0.0_3.0_1654535929048.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen_en_4.0.0_3.0_1654535929048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.by_DaisyMak").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/DaisyMak/bert-finetuned-squad-accelerate-10epoch_transformerfrozen
---
layout: model
title: Spam Classifier
author: John Snow Labs
name: classifierdl_use_spam
class: ClassifierDLModel
language: en
nav_key: models
repository: public/models
date: 03/07/2020
task: Text Classification
edition: Spark NLP 2.5.3
spark_version: 2.4
tags: [classifier]
supported: true
annotator: ClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Automatically identify messages as being regular messages or Spam.
{:.h2_title}
## Predicted Entities
``spam``, ``ham``
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_SPAM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_SPAM.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_spam_en_2.5.3_2.4_1593783318934.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_spam_en_2.5.3_2.4_1593783318934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained(lang="en") \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
document_classifier = ClassifierDLModel.pretrained('classifierdl_use_spam', 'en') \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")
nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate('Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now.')
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val use = UniversalSentenceEncoder.pretrained(lang="en")
.setInputCols(Array("document"))
.setOutputCol("sentence_embeddings")
val document_classifier = ClassifierDLModel.pretrained('classifierdl_use_spam', 'en')
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier))
val data = Seq("Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now."""]
spam_df = nlu.load('classify.spam.use').predict(text, output_level='document')
spam_df[["document", "spam"]]
```
{:.h2_title}
## Results
```bash
+------------------------------------------------------------------------------------------------+------------+
|document |class |
+------------------------------------------------------------------------------------------------+------------+
|Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now. | spam |
+------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
| Model Name | classifierdl_use_spam |
| Model Class | ClassifierDLModel |
| Spark Compatibility | 2.5.3 |
| Spark NLP Compatibility | 2.4 |
| License | open source |
| Edition | public |
| Input Labels | [document, sentence_embeddings] |
| Output Labels | [class] |
| Language | en |
| Upstream Dependencies | tfhub_use |
{:.h2_title}
## Data Source
This model is trained on UCI spam dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
{:.h2_title}
## Benchmarking
Accuracy of the model with USE Embeddings is `0.86`
```bash
precision recall f1-score support
ham 0.86 1.00 0.92 1440
spam 0.00 0.00 0.00 238
accuracy 0.86 1678
macro avg 0.43 0.50 0.46 1678
weighted avg 0.74 0.86 0.79 1678
```
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah TFWav2Vec2ForCTC from nimrah
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah` is a English model originally trained by nimrah.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah_en_4.2.0_3.0_1664115813044.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah_en_4.2.0_3.0_1664115813044.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: German ElectraForQuestionAnswering Distilled model (from deepset)
author: John Snow Labs
name: electra_qa_g_base_germanquad_distilled
date: 2022-06-22
tags: [de, open_source, electra, question_answering]
task: Question Answering
language: de
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gelectra-base-germanquad-distilled` is a German model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_g_base_germanquad_distilled_de_4.0.0_3.0_1655921806755.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_g_base_germanquad_distilled_de_4.0.0_3.0_1655921806755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_g_base_germanquad_distilled","de") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_g_base_germanquad_distilled","de")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.answer_question.electra.distilled_base").predict("""Was ist mein Name?|||"Mein Name ist Clara und ich lebe in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_g_base_germanquad_distilled|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|de|
|Size:|410.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/deepset/gelectra-base-germanquad-distilled
- https://deepset.ai/germanquad
- https://deepset.ai/german-bert
- https://github.com/deepset-ai/FARM
- https://github.com/deepset-ai/haystack/
- https://haystack.deepset.ai/community/join
---
layout: model
title: Part of Speech for Armenian
author: John Snow Labs
name: pos_ud_armtdp
date: 2020-07-29 23:34:00 +0800
task: Part of Speech Tagging
language: hy
edition: Spark NLP 2.5.5
spark_version: 2.4
tags: [pos, hy]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_armtdp_hy_2.5.5_2.4_1596053517801.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_armtdp_hy_2.5.5_2.4_1596053517801.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_armtdp", "hy") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_armtdp", "hy")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:"""]
pos_df = nlu.load('hy.pos').predict(text, output_level='token')
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=10, result='ADJ', metadata={'word': 'Հյուսիսային'}),
Row(annotatorType='pos', begin=12, end=18, result='ADJ', metadata={'word': 'թագավոր'}),
Row(annotatorType='pos', begin=20, end=27, result='NOUN', metadata={'word': 'լինելուց'}),
Row(annotatorType='pos', begin=29, end=32, result='ADP', metadata={'word': 'բացի'}),
Row(annotatorType='pos', begin=33, end=33, result='PUNCT', metadata={'word': ','}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_armtdp|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.5+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|hy|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Legal Confidential Treatment Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_confidential_treatment_bert
date: 2023-01-26
tags: [en, legal, classification, confidential, treatment, licensed, bert, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_confidential_treatment_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `confidential-treatment` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`confidential-treatment`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_confidential_treatment_bert_en_1.0.0_3.0_1674732030348.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_confidential_treatment_bert_en_1.0.0_3.0_1674732030348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[confidential-treatment]|
|[other]|
|[other]|
|[confidential-treatment]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_confidential_treatment_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
confidential-treatment 0.98 0.98 0.98 55
other 0.99 0.99 0.99 116
accuracy - - 0.99 171
macro-avg 0.99 0.99 0.99 171
weighted-avg 0.99 0.99 0.99 171
```
---
layout: model
title: English image_classifier_vit_rock_challenge_ViT_two_by_two ViTForImageClassification from dimbyTa
author: John Snow Labs
name: image_classifier_vit_rock_challenge_ViT_two_by_two
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rock_challenge_ViT_two_by_two` is a English model originally trained by dimbyTa.
## Predicted Entities
`fines`, `large`, `medium`, `pellets`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rock_challenge_ViT_two_by_two_en_4.1.0_3.0_1660168489231.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rock_challenge_ViT_two_by_two_en_4.1.0_3.0_1660168489231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_rock_challenge_ViT_two_by_two", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_rock_challenge_ViT_two_by_two", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_rock_challenge_ViT_two_by_two|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Part of Speech for Marathi
author: John Snow Labs
name: pos_ud_ufal
date: 2020-07-29 23:34:00 +0800
task: Part of Speech Tagging
language: mr
edition: Spark NLP 2.5.5
spark_version: 2.4
tags: [pos, mr]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ufal_mr_2.5.5_2.4_1596054314811.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ufal_mr_2.5.5_2.4_1596054314811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_ufal", "mr") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे.")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_ufal", "mr")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे."""]
pos_df = nlu.load('mr.pos').predict(text)
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=7, result='NOUN', metadata={'word': 'उत्तरेचा'}),
Row(annotatorType='pos', begin=9, end=12, result='NOUN', metadata={'word': 'राजा'}),
Row(annotatorType='pos', begin=14, end=29, result='NOUN', metadata={'word': 'होण्याव्यतिरिक्त'}),
Row(annotatorType='pos', begin=30, end=30, result='PUNCT', metadata={'word': ','}),
Row(annotatorType='pos', begin=32, end=34, result='NOUN', metadata={'word': 'जॉन'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_ufal|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.5+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|mr|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Fast Neural Machine Translation Model from English to Pangasinan
author: John Snow Labs
name: opus_mt_en_pag
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, pag, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `pag`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_pag_xx_2.7.0_2.4_1609169951806.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_pag_xx_2.7.0_2.4_1609169951806.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_pag", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_pag", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.pag').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_pag|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from jgammack)
author: John Snow Labs
name: distilbert_qa_sae_base_uncased_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SAE-distilbert-base-uncased-squad` is a English model originally trained by `jgammack`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sae_base_uncased_squad_en_4.3.0_3.0_1672765542477.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sae_base_uncased_squad_en_4.3.0_3.0_1672765542477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sae_base_uncased_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sae_base_uncased_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_sae_base_uncased_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/jgammack/SAE-distilbert-base-uncased-squad
---
layout: model
title: English image_classifier_vit_amgerindaf ViTForImageClassification from gaganpathre
author: John Snow Labs
name: image_classifier_vit_amgerindaf
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_amgerindaf` is a English model originally trained by gaganpathre.
## Predicted Entities
`african`, `american`, `german`, `indian`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_amgerindaf_en_4.1.0_3.0_1660172489317.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_amgerindaf_en_4.1.0_3.0_1660172489317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_amgerindaf", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_amgerindaf", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_amgerindaf|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: XLNet Large CoNLL-03 NER Pipeline
author: John Snow Labs
name: xlnet_large_token_classifier_conll03_pipeline
date: 2022-06-19
tags: [open_source, ner, token_classifier, xlnet, conll03, large, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [xlnet_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/28/xlnet_large_token_classifier_conll03_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlnet_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654301280.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlnet_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654301280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PERSON |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlnet_large_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.4 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- XlnetForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: English DistilBertForTokenClassification Cased model (from ml6team)
author: John Snow Labs
name: distilbert_token_classifier_keyphrase_extraction_kptimes
date: 2023-03-03
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-kptimes` is a English model originally trained by `ml6team`.
## Predicted Entities
`KEY`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_kptimes_en_4.3.1_3.0_1677881468043.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_kptimes_en_4.3.1_3.0_1677881468043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_kptimes","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_kptimes","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_keyphrase_extraction_kptimes|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ml6team/keyphrase-extraction-distilbert-kptimes
- https://arxiv.org/abs/1911.12559
- https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=kptimes
---
layout: model
title: Chinese T5ForConditionalGeneration Cased model (from wawaup)
author: John Snow Labs
name: t5_mengzit5_comment
date: 2023-01-30
tags: [zh, open_source, t5]
task: Text Generation
language: zh
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MengziT5-Comment` is a Chinese model originally trained by `wawaup`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mengzit5_comment_zh_4.3.0_3.0_1675098308627.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mengzit5_comment_zh_4.3.0_3.0_1675098308627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_mengzit5_comment","zh") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_mengzit5_comment","zh")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_mengzit5_comment|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|zh|
|Size:|1.0 GB|
## References
- https://huggingface.co/wawaup/MengziT5-Comment
- https://github.com/lancopku/Graph-to-seq-comment-generation
---
layout: model
title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations)
author: John Snow Labs
name: legner_mapa
date: 2023-04-27
tags: [el, ner, legal, mapa, licensed]
task: Named Entity Recognition
language: el
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Greek` documents.
## Predicted Entities
`ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_el_1.0.0_3.0_1682590655353.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_el_1.0.0_3.0_1682590655353.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_el_cased", "el")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "el", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""86 Στην υπόθεση της κύριας δίκης, προκύπτει ότι ορισμένοι εργαζόμενοι της Martin‑Meat αποσπάσθηκαν στην Αυστρία κατά την περίοδο μεταξύ του έτους 2007 και του έτους 2012, για την εκτέλεση εργασιών τεμαχισμού κρέατος σε εγκαταστάσεις της Alpenrind."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+-----------+------------+
|chunk |ner_label |
+-----------+------------+
|Martin‑Meat|ORGANISATION|
|Αυστρία |ADDRESS |
|2007 |DATE |
|2012 |DATE |
|Alpenrind |ORGANISATION|
+-----------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_mapa|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|el|
|Size:|16.4 MB|
## References
The dataset is available [here](https://huggingface.co/datasets/joelito/mapa).
## Benchmarking
```bash
label precision recall f1-score support
ADDRESS 0.89 1.00 0.94 16
AMOUNT 0.82 0.75 0.78 12
DATE 0.98 0.98 0.98 65
ORGANISATION 0.85 0.85 0.85 40
PERSON 0.90 0.95 0.92 38
macro-avg 0.91 0.93 0.92 171
macro-avg 0.89 0.91 0.90 171
weighted-avg 0.91 0.93 0.92 171
```
---
layout: model
title: English DistilBertForTokenClassification Cased model (from ml6team)
author: John Snow Labs
name: distilbert_token_classifier_keyphrase_extraction_kptimes
date: 2023-03-06
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-kptimes` is a English model originally trained by `ml6team`.
## Predicted Entities
`KEY`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_kptimes_en_4.3.1_3.0_1678133866327.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_kptimes_en_4.3.1_3.0_1678133866327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_kptimes","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_kptimes","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_keyphrase_extraction_kptimes|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ml6team/keyphrase-extraction-distilbert-kptimes
- https://arxiv.org/abs/1911.12559
- https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=kptimes
---
layout: model
title: Ganda asr_wav2vec2_luganda_by_cahya TFWav2Vec2ForCTC from cahya
author: John Snow Labs
name: pipeline_asr_wav2vec2_luganda_by_cahya
date: 2022-09-24
tags: [wav2vec2, lg, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: lg
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_luganda_by_cahya` is a Ganda model originally trained by cahya.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_luganda_by_cahya_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_luganda_by_cahya_lg_4.2.0_3.0_1664037813297.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_luganda_by_cahya_lg_4.2.0_3.0_1664037813297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_luganda_by_cahya', lang = 'lg')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_luganda_by_cahya", lang = "lg")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_luganda_by_cahya|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|lg|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: NER Model Finder
author: John Snow Labs
name: ner_model_finder
date: 2022-09-05
tags: [pretrainedpipeline, clinical, ner, en, licensed]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is trained with bert embeddings and can be used to find the most appropriate NER model given the entity name.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_MODEL_FINDER/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_model_finder_en_4.1.0_3.0_1662378666469.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_model_finder_en_4.1.0_3.0_1662378666469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
ner_pipeline = PretrainedPipeline("ner_model_finder", "en", "clinical/models")
result = ner_pipeline.annotate("medication")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val ner_pipeline = PretrainedPipeline("ner_model_finder","en","clinical/models")
val result = ner_pipeline.annotate("medication")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.ner.model_finder.pipeline").predict("""Put your text here.""")
```
## Results
```bash
{'model_names': ["['ner_posology_greedy', 'jsl_ner_wip_modifier_clinical', 'ner_posology_small', 'ner_jsl_greedy', 'ner_ade_clinical', 'ner_posology', 'ner_risk_factors', 'ner_ade_healthcare', 'ner_drugs_large', 'ner_jsl_slim', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_drugs_greedy', 'ner_pathogen']"]}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_model_finder|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|155.9 MB|
## Included Models
- DocumentAssembler
- BertSentenceEmbeddings
- SentenceEntityResolverModel
- Finisher
---
layout: model
title: Korean asr_wav2vec2_large_xlsr_korean TFWav2Vec2ForCTC from kresnik
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_korean
date: 2022-09-25
tags: [wav2vec2, ko, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: ko
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_korean` is a Korean model originally trained by kresnik.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_korean_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_korean_ko_4.2.0_3.0_1664112533768.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_korean_ko_4.2.0_3.0_1664112533768.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_korean', lang = 'ko')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_korean", lang = "ko")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_korean|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|ko|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Chinese Bert Embeddings (Base, captions dataset)
author: John Snow Labs
name: bert_embeddings_mengzi_oscar_base_caption
date: 2022-04-11
tags: [bert, embeddings, zh, open_source]
task: Embeddings
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `mengzi-oscar-base-caption` is a Chinese model orginally trained by `Langboat`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_oscar_base_caption_zh_3.4.2_3.0_1649670622527.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_oscar_base_caption_zh_3.4.2_3.0_1649670622527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_oscar_base_caption","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_oscar_base_caption","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.mengzi_oscar_base_caption").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_mengzi_oscar_base_caption|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|383.6 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Langboat/mengzi-oscar-base-caption
- https://arxiv.org/abs/2110.06696
- https://github.com/Langboat/Mengzi/blob/main/Mengzi-Oscar.md
- https://github.com/microsoft/Oscar/blob/master/INSTALL.md
- https://github.com/Langboat/Mengzi/blob/main/Mengzi-Oscar.md
---
layout: model
title: Catalan RobertaForQuestionAnswering Base Cased model (from projecte-aina)
author: John Snow Labs
name: roberta_qa_base_ca_cased
date: 2022-12-02
tags: [ca, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: ca
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-ca-cased-qa` is a Catalan model originally trained by `projecte-aina`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_ca_cased_ca_4.2.4_3.0_1669986048039.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_ca_cased_ca_4.2.4_3.0_1669986048039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_ca_cased","ca")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_ca_cased","ca")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_ca_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|ca|
|Size:|451.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/projecte-aina/roberta-base-ca-cased-qa
- https://arxiv.org/abs/1907.11692
- https://github.com/projecte-aina/club
- https://www.apache.org/licenses/LICENSE-2.0
- https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca%7Cen
- https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina
---
layout: model
title: German asr_wav2vec2_large_xlsr_german_demo TFWav2Vec2ForCTC from marcel
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_german_demo
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_german_demo` is a German model originally trained by marcel.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_german_demo_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_german_demo_de_4.2.0_3.0_1664103787136.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_german_demo_de_4.2.0_3.0_1664103787136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_german_demo", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_german_demo", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_german_demo|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from evegarcianz)
author: John Snow Labs
name: distilbert_qa_finetuned_adversarial
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-adversarial_qa` is a English model originally trained by `evegarcianz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_adversarial_en_4.3.0_3.0_1672765744855.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_adversarial_en_4.3.0_3.0_1672765744855.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_adversarial","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_adversarial","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_finetuned_adversarial|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/evegarcianz/bert-finetuned-adversarial_qa
---
layout: model
title: Legal Portuguese Embeddings (Base, Agreements)
author: John Snow Labs
name: bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos
date: 2022-04-11
tags: [bert, embeddings, pt, open_source]
task: Embeddings
language: pt
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-portuguese-cased-finetuned-tcu-acordaos` is a Portuguese model orginally trained by `Luciano`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos_pt_3.4.2_3.0_1649674108376.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos_pt_3.4.2_3.0_1649674108376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos","pt") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos","pt")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Eu amo Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("pt.embed.bert_base_portuguese_cased_finetuned_tcu_acordaos").predict("""Eu amo Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|pt|
|Size:|408.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Luciano/bert-base-portuguese-cased-finetuned-tcu-acordaos
---
layout: model
title: Fast Neural Machine Translation Model from English to French-Based Creoles And Pidgins
author: John Snow Labs
name: opus_mt_en_cpf
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, cpf, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `cpf`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_cpf_xx_2.7.0_2.4_1609169519834.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_cpf_xx_2.7.0_2.4_1609169519834.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_cpf", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_cpf", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.cpf').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_cpf|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Smaller BERT Embeddings (L-10_H-512_A-8)
author: John Snow Labs
name: small_bert_L10_512
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L10_512_en_2.6.0_2.4_1598344780916.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L10_512_en_2.6.0_2.4_1598344780916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("small_bert_L10_512", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("small_bert_L10_512", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert.small_L10_512').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_bert_small_L10_512_embeddings
I [0.08983156085014343, 0.6781706809997559, -0.1...
love [-0.22787825763225555, 0.15800981223583221, 1....
NLP [0.2888692617416382, 0.49437081813812256, -0.4...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|small_bert_L10_512|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|512|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_12_h_256
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-256` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_256_zh_4.2.4_3.0_1670325800870.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_256_zh_4.2.4_3.0_1670325800870.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_256","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_256","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_12_h_256|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|57.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-12_H-256
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Legal Capitalization Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_capitalization_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, capitalization, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Capitalization` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Capitalization`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_capitalization_bert_en_1.0.0_3.0_1678050537123.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_capitalization_bert_en_1.0.0_3.0_1678050537123.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Capitalization]|
|[Other]|
|[Other]|
|[Capitalization]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_capitalization_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Capitalization 0.92 0.96 0.94 48
Other 0.97 0.94 0.96 70
accuracy - - 0.95 118
macro-avg 0.95 0.95 0.95 118
weighted-avg 0.95 0.95 0.95 118
```
---
layout: model
title: Part of Speech for Swedish
author: John Snow Labs
name: pos_ud_tal
date: 2020-05-04 23:32:00 +0800
task: Part of Speech Tagging
language: sv
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [pos, es]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_tal_sv_2.5.0_2.4_1588622711284.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_tal_sv_2.5.0_2.4_1588622711284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_tal", "sv") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien.")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_tal", "sv")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien."""]
pos_df = nlu.load('sv.pos.ud_tal').predict(text, output_level='token')
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=6, result='ADP', metadata={'word': 'Förutom'}),
Row(annotatorType='pos', begin=8, end=10, result='PART', metadata={'word': 'att'}),
Row(annotatorType='pos', begin=12, end=15, result='AUX', metadata={'word': 'vara'}),
Row(annotatorType='pos', begin=17, end=22, result='NOUN', metadata={'word': 'kungen'}),
Row(annotatorType='pos', begin=24, end=24, result='ADP', metadata={'word': 'i'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_tal|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|sv|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Pipeline to Detect PHI for Deidentification (Generic - Augmented)
author: John Snow Labs
name: ner_deid_generic_augmented_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, deidentification, generic, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_generic_augmented](https://nlp.johnsnowlabs.com/2021/06/30/ner_deid_generic_augmented_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_pipeline_en_3.4.1_3.0_1647869128382.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_pipeline_en_3.4.1_3.0_1647869128382.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_generic_augmented_pipeline", "en", "clinical/models")
pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.")
```
```scala
val pipeline = new PretrainedPipeline("ner_deid_generic_augmented_pipeline", "en", "clinical/models")
pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.deid_generic_augmented.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.""")
```
## Results
```bash
+-------------------------------------------------+---------+
|chunk |ner_label|
+-------------------------------------------------+---------+
|2093-01-13 |DATE |
|David Hale |NAME |
|Hendrickson |NAME |
|Ora MR. |LOCATION |
|7194334 |ID |
|01/13/93 |DATE |
|Oliveira |NAME |
|25 |AGE |
|1-11-2000 |DATE |
|Cocke County Baptist Hospital. 0295 Keats Street.|LOCATION |
|(302) 786-5227 |CONTACT |
+-------------------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_generic_augmented_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Detect Drug Information (Large)
author: John Snow Labs
name: ner_posology_large
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for posology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
`DOSAGE`, `DRUG`, `DURATION`, `FORM`, `FREQUENCY`, `ROUTE`, `STRENGTH`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_en_3.0.0_3.0_1617207221150.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_en_3.0.0_3.0_1617207221150.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_posology_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("entities")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.']], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_posology_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("entities")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val data = Seq("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.posology.large").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_1b_1_finetuned_squadv1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_1b_1_finetuned_squadv1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_1b_1_finetuned_squadv1|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|446.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mrm8488/roberta-base-1B-1-finetuned-squadv1
- https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/
- https://twitter.com/mrm8488
- https://www.linkedin.com/in/manuel-romero-cs/
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10_en_4.0.0_3.0_1655730685006.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10_en_4.0.0_3.0_1655730685006.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_1024d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|439.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-10
---
layout: model
title: Legal Application Of Trust Money Clause Binary Classifier
author: John Snow Labs
name: legclf_application_of_trust_money_clause
date: 2023-01-27
tags: [en, legal, classification, application, trust, money, clauses, application_of_trust_money, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `application-of-trust-money` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`application-of-trust-money`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_application_of_trust_money_clause_en_1.0.0_3.0_1674820460698.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_application_of_trust_money_clause_en_1.0.0_3.0_1674820460698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[application-of-trust-money]|
|[other]|
|[other]|
|[application-of-trust-money]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_application_of_trust_money_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
application-of-trust-money 1.00 0.89 0.94 18
other 0.95 1.00 0.97 36
accuracy - - 0.96 54
macro-avg 0.97 0.94 0.96 54
weighted-avg 0.96 0.96 0.96 54
```
---
layout: model
title: Word2Vec Embeddings in Breton (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-14
tags: [cc, embeddings, fastText, word2vec, br, open_source]
task: Embeddings
language: br
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_br_3.4.1_3.0_1647287953369.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_br_3.4.1_3.0_1647287953369.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","br") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","br")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("br.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|br|
|Size:|351.5 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Abkhazian asr_wav2vec2_common_voice_ab_demo TFWav2Vec2ForCTC from patrickvonplaten
author: John Snow Labs
name: pipeline_asr_wav2vec2_common_voice_ab_demo
date: 2022-09-24
tags: [wav2vec2, ab, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: ab
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_common_voice_ab_demo` is a Abkhazian model originally trained by patrickvonplaten.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_common_voice_ab_demo_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_common_voice_ab_demo_ab_4.2.0_3.0_1664042317411.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_common_voice_ab_demo_ab_4.2.0_3.0_1664042317411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_common_voice_ab_demo', lang = 'ab')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_common_voice_ab_demo", lang = "ab")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_common_voice_ab_demo|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|ab|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model
author: John Snow Labs
name: distilbert_qa_base_cased_led_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad` is a English model originally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_en_4.3.0_3.0_1672766495924.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_en_4.3.0_3.0_1672766495924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_cased_led_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/distilbert-base-cased-distilled-squad
- https://arxiv.org/abs/1910.01108
- https://arxiv.org/abs/1910.01108
- https://aclanthology.org/2021.acl-long.330.pdf
- https://dl.acm.org/doi/pdf/10.1145/3442188.3445922
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
- https://mlco2.github.io/impact#compute
- https://arxiv.org/abs/1910.09700
- https://arxiv.org/pdf/1910.01108.pdf
- https://arxiv.org/abs/1910.01108
- https://paperswithcode.com/sota?task=Question+Answering&dataset=squad
---
layout: model
title: Visual NER on 10K Filings (SEC)
author: John Snow Labs
name: visualner_10kfilings
date: 2022-09-21
tags: [en, licensed]
task: OCR Object Detection
language: en
nav_key: models
edition: Visual NLP 4.0.0
spark_version: 3.2
supported: true
annotator: VisualDocumentNERv21
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Visual NER team aimed to extract the main key points in the summary page of SEC 10 K filings (Annual reports).
## Predicted Entities
`REGISTRANT`, `ADDRESS`, `PHONE`, `DATE`, `EMPLOYERIDNB`, `EXCHANGE`, `STATE`, `STOCKCLASS`, `STOCKVALUE`, `TRADINGSYMBOL`, `FILENUMBER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/visualner_10kfilings_en_4.0.0_3.2_1663769328577.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/visualner_10kfilings_en_4.0.0_3.2_1663769328577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|filename|exploded_entities |
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|t01.jpg |{named_entity, 712, 716, OTHERS, {confidence -> 96, width -> 74, x -> 1557, y -> 416, word -> Ended, token -> ended, height -> 18}, []} |
|t01.jpg |{named_entity, 718, 724, DATE-B, {confidence -> 96, width -> 97, x -> 1639, y -> 416, word -> January, token -> january, height -> 24}, []} |
|t01.jpg |{named_entity, 726, 727, DATE-I, {confidence -> 95, width -> 34, x -> 1743, y -> 416, word -> 31,, token -> 31, height -> 22}, []} |
|t01.jpg |{named_entity, 730, 733, DATE-I, {confidence -> 96, width -> 54, x -> 1785, y -> 416, word -> 2021, token -> 2021, height -> 18}, []} |
|t01.jpg |{named_entity, 735, 744, OTHERS, {confidence -> 91, width -> 143, x -> 1372, y -> 472, word -> Commission, token -> commission, height -> 18}, []} |
|t01.jpg |{named_entity, 746, 749, OTHERS, {confidence -> 96, width -> 36, x -> 1523, y -> 472, word -> file, token -> file, height -> 18}, []} |
|t01.jpg |{named_entity, 751, 756, OTHERS, {confidence -> 92, width -> 96, x -> 1568, y -> 472, word -> number:, token -> number, height -> 18}, []} |
|t01.jpg |{named_entity, 759, 761, FILENUMBER-B, {confidence -> 92, width -> 119, x -> 1675, y -> 472, word -> 001-39495, token -> 001, height -> 18}, []} |
|t01.jpg |{named_entity, 769, 773, REGISTRANT-B, {confidence -> 92, width -> 136, x -> 1472, y -> 558, word -> ASANA,, token -> asana, height -> 31}, []} |
|t01.jpg |{named_entity, 776, 778, REGISTRANT-I, {confidence -> 95, width -> 72, x -> 1620, y -> 558, word -> INC., token -> inc, height -> 25}, []}
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|visualner_10kfilings|
|Type:|ocr|
|Compatibility:|Visual NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|744.4 MB|
## References
SEC 10k filings
---
layout: model
title: Detect Clinical Entities (ner_jsl_enriched)
author: John Snow Labs
name: ner_jsl_enriched
date: 2021-10-22
tags: [ner, licensed, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.3.0
spark_version: 3.0
supported: true
recommended: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terminology. This model is capable of predicting up to `87` different entities and is based on `ner_jsl`.
Definitions of Predicted Entities:
- `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else.
- `Direction`: All the information relating to the laterality of the internal and external organs.
- `Test`: Mentions of laboratory, pathology, and radiological tests.
- `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient.
- `Death_Entity`: Mentions that indicate the death of a patient.
- `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced).
- `Duration`: The duration of a medical treatment or medication use.
- `Respiration`: Number of breaths per minute.
- `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims.
- `Birth_Entity`: Mentions that indicate giving birth.
- `Age`: All mention of ages, past or present, related to the patient or with anybody else.
- `Labour_Delivery`: Extractions include stages of labor and delivery.
- `Family_History_Header`: identifies section headers that correspond to Family History of the patient.
- `BMI`: Numeric values and other text information related to Body Mass Index.
- `Temperature`: All mentions that refer to body temperature.
- `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else.
- `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic").
- `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else.
- `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient.
- `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events.
- `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP).
- `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements.
- `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else.
- `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases.
- `Employment`: All mentions of patient or provider occupational titles and employment status .
- `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels).
- `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.).
- `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.).
- `ImagingFindings`: All mentions of radiographic and imagistic findings.
- `Procedure`: All mentions of invasive medical or surgical procedures or treatments.
- `Medical_Device`: All mentions related to medical devices and supplies.
- `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups.
- `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels).
- `Symptom`: All the symptoms mentioned in the document, of a patient or someone else.
- `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure").
- `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs).
- `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Drug_Ingredient`: Active ingredient/s found in drug products.
- `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted.
- `Diet`: All mentions and information regarding patients dietary habits.
- `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye.
- `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein).
- `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs.
- `Allergen`: Allergen related extractions mentioned in the document.
- `EKG_Findings`: All mentions of EKG readings.
- `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology.
- `Triglycerides`: All mentions terms related to specific lab test for Triglycerides.
- `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning").
- `Gender`: Gender-specific nouns and pronouns.
- `Pulse`: Peripheral heart rate, without advanced information like measurement location.
- `Social_History_Header`: Identifies section headers that correspond to Social History of a patient.
- `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs).
- `Diabetes`: All terms related to diabetes mellitus.
- `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately.
- `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye.
- `Clinical_Dept`: Terms that indicate the medical and/or surgical departments.
- `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients.
- `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.).
- `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago").
- `Height`: All mentions related to a patients height.
- `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included).
- `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity.
- `Frequency`: Frequency of administration for a dose prescribed.
- `Time`: Specific time references (hour and/or minutes).
- `Weight`: All mentions related to a patients weight.
- `Vaccine`: Generic and brand name of vaccines or vaccination procedure.
- `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient.
- `Communicable_Disease`: Includes all mentions of communicable diseases.
- `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately).
- `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure).
- `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein).
- `Total_Cholesterol`: Terms related to the lab test and results for cholesterol.
- `Smoking`: All mentions of smoking status of a patient.
- `Date`: Mentions of an exact date, in any format, including day number, month and/or year.
## Predicted Entities
`Social_History_Header`, `Oncology_Therapy`, `Blood_Pressure`, `Respiration`, `Performance_Status`, `Family_History_Header`, `Dosage`, `Clinical_Dept`, `Diet`, `Procedure`, `HDL`, `Weight`, `Admission_Discharge`, `LDL`, `Kidney_Disease`, `Oncological`, `Route`, `Imaging_Technique`, `Puerperium`, `Overweight`, `Temperature`, `Diabetes`, `Vaccine`, `Age`, `Test_Result`, `Employment`, `Time`, `Obesity`, `EKG_Findings`, `Pregnancy`, `Communicable_Disease`, `BMI`, `Strength`, `Tumor_Finding`, `Section_Header`, `RelativeDate`, `ImagingFindings`, `Death_Entity`, `Date`, `Cerebrovascular_Disease`, `Treatment`, `Labour_Delivery`, `Pregnancy_Delivery_Puerperium`, `Direction`, `Internal_organ_or_component`, `Psychological_Condition`, `Form`, `Medical_Device`, `Test`, `Symptom`, `Disease_Syndrome_Disorder`, `Staging`, `Birth_Entity`, `Hyperlipidemia`, `O2_Saturation`, `Frequency`, `External_body_part_or_region`, `Drug_Ingredient`, `Vital_Signs_Header`, `Substance_Quantity`, `Race_Ethnicity`, `VS_Finding`, `Injury_or_Poisoning`, `Medical_History_Header`, `Alcohol`, `Triglycerides`, `Total_Cholesterol`, `Sexually_Active_or_Sexual_Orientation`, `Female_Reproductive_Status`, `Relationship_Status`, `Drug_BrandName`, `RelativeTime`, `Duration`, `Hypertension`, `Metastasis`, `Gender`, `Oxygen_Therapy`, `Pulse`, `Heart_Disease`, `Modifier`, `Allergen`, `Smoking`, `Substance`, `Cancer_Modifier`, `Fetus_NewBorn`, `Height`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_en_3.3.0_3.0_1634865045033.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_en_3.3.0_3.0_1634865045033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
jsl_ner = MedicalNerModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("jsl_ner")
jsl_ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "jsl_ner"]) \
.setOutputCol("ner_chunk")
jsl_ner_pipeline = Pipeline().setStages([
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
jsl_ner,
jsl_ner_converter])
jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]]).toDF("text")
result = jsl_ner_model.transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val jsl_ner = MedicalNerModel.pretrained("ner_jsl_enriched", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("jsl_ner")
val jsl_ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "jsl_ner"))
.setOutputCol("ner_chunk")
val jsl_ner_pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
jsl_ner,
jsl_ner_converter))
val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text")
val result = jsl_ner_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.jsl.enriched").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
+-------+
| result|
+-------+
|[set-off]|
|[other]|
|[other]|
|[set-off]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_set_off_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.95 0.98 0.96 94
set-off 0.94 0.86 0.90 36
accuracy - - 0.95 130
macro-avg 0.94 0.92 0.93 130
weighted-avg 0.95 0.95 0.95 130
```
---
layout: model
title: English RobertaForQuestionAnswering (from vuiseng9)
author: John Snow Labs
name: roberta_qa_roberta_l_squadv1.1
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-l-squadv1.1` is a English model originally trained by `vuiseng9`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_l_squadv1.1_en_4.0.0_3.0_1655735988687.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_l_squadv1.1_en_4.0.0_3.0_1655735988687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_l_squadv1.1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_l_squadv1.1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_l_squadv1.1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/vuiseng9/roberta-l-squadv1.1
---
layout: model
title: Legal Advice Class Identifier
author: John Snow Labs
name: legclf_reddit_advice
date: 2023-03-10
tags: [en, licensed, legal, classifier, reddit, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
recommended: true
engine: tensorflow
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Multiclass classification model which retrieves the topic/class of an informal message from a legal forum, including the following classes: `digital`, `business`, `insurance`, `contract`, `driving`, `school`, `family`, `wills`, `employment`, `housing`, `criminal`.
## Predicted Entities
`digital`, `business`, `insurance`, `contract`, `driving`, `school`, `family`, `wills`, `employment`, `housing`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_reddit_advice_en_1.0.0_3.0_1678448985639.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_reddit_advice_en_1.0.0_3.0_1678448985639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = nlp.Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
seq_classifier = legal.BertForSequenceClassification.pretrained("legclf_reddit_advice", "en", "legal/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = nlp.Pipeline(stages=[documentAssembler, tokenizer, seq_classifier])
data = spark.createDataFrame([["Mother of my child took my daughter and moved (without notice), won't let me see her or tell me where she is."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
## Results
```bash
+--------+
| result|
+--------+
|[family]|
+--------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_reddit_advice|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
Train dataset available [here](https://huggingface.co/datasets/jonathanli/legal-advice-reddit)
## Benchmarking
```bash
label precision recall f1-score support
business 0.76 0.67 0.72 239
contract 0.80 0.68 0.73 207
criminal 0.82 0.77 0.80 209
digital 0.76 0.74 0.75 223
driving 0.86 0.85 0.86 223
employment 0.76 0.92 0.83 222
family 0.88 0.95 0.92 216
housing 0.89 0.95 0.92 221
insurance 0.83 0.80 0.81 221
school 0.87 0.91 0.89 207
wills 0.95 0.96 0.96 199
accuracy - - 0.83 2387
macro-avg 0.84 0.84 0.83 2387
weighted-avg 0.83 0.83 0.83 2387
```
---
layout: model
title: Extract Temporal Entities from Voice of the Patient Documents (embeddings_clinical)
author: John Snow Labs
name: ner_vop_temporal
date: 2023-06-06
tags: [clinical, licensed, ner, en, vop, temporal]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts temporal references from the documents transferred from the patient’s own sentences.
## Predicted Entities
`DateTime`, `Duration`, `Frequency`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_en_4.4.3_3.0_1686076127059.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_en_4.4.3_3.0_1686076127059.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_temporal", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_temporal", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:-----------|:------------|
| last month | DateTime |
| yesterday | DateTime |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_temporal|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.8 MB|
|Dependencies:|embeddings_clinical|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
DateTime 4056 655 346 4402 0.86 0.92 0.89
Duration 2008 371 302 2310 0.84 0.87 0.86
Frequency 879 157 200 1079 0.85 0.81 0.83
macro_avg 6943 1183 848 7791 0.85 0.87 0.86
micro_avg 6943 1183 848 7791 0.85 0.89 0.87
```
---
layout: model
title: Slovak RobertaForMaskedLM Cased model (from fav-kky)
author: John Snow Labs
name: roberta_embeddings_fernet_news
date: 2022-12-12
tags: [sk, open_source, roberta_embeddings, robertaformaskedlm]
task: Embeddings
language: sk
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `FERNET-News_sk` is a Slovak model originally trained by `fav-kky`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fernet_news_sk_4.2.4_3.0_1670858429673.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fernet_news_sk_4.2.4_3.0_1670858429673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_fernet_news","sk") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_fernet_news","sk")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_fernet_news|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|sk|
|Size:|467.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/fav-kky/FERNET-News_sk
- https://arxiv.org/abs/2107.10042
---
layout: model
title: Clinical English Bert Embeddings (Base, 128 dimension)
author: John Snow Labs
name: bert_embeddings_clinical_pubmed_bert_base_128
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `clinical-pubmed-bert-base-128` is a English model orginally trained by `Tsubasaz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_clinical_pubmed_bert_base_128_en_3.4.2_3.0_1649672767031.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_clinical_pubmed_bert_base_128_en_3.4.2_3.0_1649672767031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_clinical_pubmed_bert_base_128","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_clinical_pubmed_bert_base_128","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.clinical_pubmed_bert_base_128").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_clinical_pubmed_bert_base_128|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|410.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Tsubasaz/clinical-pubmed-bert-base-128
- https://mimic.physionet.org/
---
layout: model
title: Korean ElectraForQuestionAnswering model (from monologg) Version-3
author: John Snow Labs
name: electra_qa_base_v3_finetuned_korquad
date: 2022-06-22
tags: [ko, open_source, electra, question_answering]
task: Question Answering
language: ko
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-base-v3-finetuned-korquad` is a Korean model originally trained by `monologg`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_v3_finetuned_korquad_ko_4.0.0_3.0_1655922227904.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_v3_finetuned_korquad_ko_4.0.0_3.0_1655922227904.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_v3_finetuned_korquad","ko") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_v3_finetuned_korquad","ko")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.answer_question.korquad.electra.base").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_base_v3_finetuned_korquad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|ko|
|Size:|419.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad
---
layout: model
title: Sentence Entity Resolver for UMLS CUI Codes
author: John Snow Labs
name: sbiobertresolve_umls_findings
date: 2021-04-30
tags: [en, clinical, licensed, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.2
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Map clinical entities to UMLS CUI codes.
## Predicted Entities
This model returns CUI (concept unique identifier) codes for 200K concepts from clinical findings.
https://www.nlm.nih.gov/research/umls/index.html
{:.btn-box}
[Live Demo](http://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_3.0.2_3.0_1619774838339.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_3.0.2_3.0_1619774838339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_umls_findings","en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver])
data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text")
results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.umls.findings").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""")
```
## Results
```bash
| | ner_chunk | cui_code |
|---:|:--------------------------------------|:-----------|
| 0 | gestational diabetes mellitus | C2183115 |
| 1 | subsequent type two diabetes mellitus | C3532488 |
| 2 | T2DM | C3280267 |
| 3 | HTG-induced pancreatitis | C4554179 |
| 4 | an acute hepatitis | C4750596 |
| 5 | obesity | C1963185 |
| 6 | a body mass index | C0578022 |
| 7 | polyuria | C3278312 |
| 8 | polydipsia | C3278316 |
| 9 | poor appetite | C0541799 |
| 10 | vomiting | C0042963 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_umls_findings|
|Compatibility:|Healthcare NLP 3.0.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[umls_code]|
|Language:|en|
## Data Source
https://www.nlm.nih.gov/research/umls/index.html
---
layout: model
title: English BertForQuestionAnswering model (from Rocketknight1)
author: John Snow Labs
name: bert_qa_bert_finetuned_qa
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-qa` is a English model orginally trained by `Rocketknight1`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_qa_en_4.0.0_3.0_1654535355191.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_qa_en_4.0.0_3.0_1654535355191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_qa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_finetuned_qa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.by_Rocketknight1").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_finetuned_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Rocketknight1/bert-finetuned-qa
---
layout: model
title: Portuguese DistilBertForQuestionAnswering Cased model (from mrm8488)
author: John Snow Labs
name: distilbert_qa_finetuned_squad
date: 2022-07-21
tags: [open_source, distilbert, question_answering, pt]
task: Question Answering
language: pt
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-multi-finedtuned-squad-pt` is a Portuguese model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_squad_pt_4.0.0_3.0_1658401584866.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_squad_pt_4.0.0_3.0_1658401584866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_squad","pt") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["PUT YOUR 'QUESTION' STRING HERE?", "PUT YOUR 'CONTEXT' STRING HERE"]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_squad","pt")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("PUT YOUR 'QUESTION' STRING HERE?", "PUT YOUR 'CONTEXT' STRING HERE").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|pt|
|Size:|505.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
https://huggingface.co/mrm8488/distilbert-multi-finedtuned-squad-pt
---
layout: model
title: Part of Speech for Japanese
author: John Snow Labs
name: pos_ud_gsd
date: 2021-01-03
task: Part of Speech Tagging
language: ja
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [pos, ja, open_source]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ja_2.7.0_2.4_1609700150824.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ja_2.7.0_2.4_1609700150824.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")\
.setInputCols(["sentence"])\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_ud_gsd", "ja") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
word_segmenter,
posTagger
])
example = spark.createDataFrame([['院長と話をしたところ、腰痛治療も得意なようです。']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_ud_gsd", "ja")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos))
val data = Seq("院長と話をしたところ、腰痛治療も得意なようです。").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""5月13日に放送されるフジテレビ系「僕らの音楽」にて、福原美穂とAIという豪華共演が決定した。"""]
pos_df = nlu.load('ja.pos.ud_gsd').predict(text, output_level='token')
pos_df
```
## Results
```bash
+------+-----+
|token |pos |
+------+-----+
|院長 |NOUN |
|と |ADP |
|話 |NOUN |
|を |ADP |
|し |VERB |
|た |AUX |
|ところ|NOUN |
|、 |PUNCT|
|腰痛 |NOUN |
|治療 |NOUN |
|も |ADP |
|得意 |ADJ |
|な |AUX |
|よう |AUX |
|です |AUX |
|。 |PUNCT|
+------+-----+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_gsd|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[pos]|
|Language:|ja|
## Data Source
The model was trained on the [Universal Dependencies](https://universaldependencies.org/), curated by Google.
Reference:
> Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018).
Universal Dependencies Version 2 for Japanese. In LREC-2018.
## Benchmarking
```bash
| pos_tag | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| ADJ | 0.90 | 0.78 | 0.84 | 350 |
| ADP | 0.98 | 0.99 | 0.99 | 2804 |
| ADV | 0.87 | 0.65 | 0.74 | 220 |
| AUX | 0.95 | 0.98 | 0.96 | 1768 |
| CCONJ | 0.97 | 0.93 | 0.95 | 42 |
| DET | 1.00 | 1.00 | 1.00 | 66 |
| INTJ | 0.00 | 0.00 | 0.00 | 1 |
| NOUN | 0.93 | 0.98 | 0.95 | 3692 |
| NUM | 0.99 | 0.98 | 0.99 | 251 |
| PART | 0.96 | 0.83 | 0.89 | 128 |
| PRON | 0.97 | 0.94 | 0.95 | 101 |
| PROPN | 0.92 | 0.70 | 0.79 | 313 |
| PUNCT | 1.00 | 1.00 | 1.00 | 1294 |
| SCONJ | 0.97 | 0.94 | 0.96 | 682 |
| SYM | 0.99 | 1.00 | 0.99 | 67 |
| VERB | 0.96 | 0.92 | 0.94 | 1255 |
| accuracy | 0.96 | 13034 | | |
| macro avg | 0.90 | 0.85 | 0.87 | 13034 |
| weighted avg | 0.96 | 0.96 | 0.95 | 13034 |
```
---
layout: model
title: Detect Person, Organization and Location in Turkish text
author: John Snow Labs
name: xlm_roberta_base_token_classifier_ner
date: 2021-12-02
tags: [xlm, roberta, ner, turkish, tr, open_source]
task: Named Entity Recognition
language: tr
edition: Spark NLP 3.3.2
spark_version: 2.4
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is imported from `Hugging Face-models`. This model is the fine-tuned version of "xlm-roberta-base" (a multilingual version of RoBERTa) using a reviewed version of well known Turkish NER dataset (https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt)
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_ner_tr_3.3.2_2.4_1638447262808.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_ner_tr_3.3.2_2.4_1638447262808.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_base_token_classifier_ner", "tr"))\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = """Benim adım Cesur Yurttaş ve İstanbul'da yaşıyorum."""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_base_token_classifier_ner", "tr"))
.setInputCols(Array("sentence","token"))
.setOutputCol("ner")
ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val example = Seq.empty["Benim adım Cesur Yurttaş ve İstanbul'da yaşıyorum."].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("tr.ner.xlm_roberta").predict("""Benim adım Cesur Yurttaş ve İstanbul'da yaşıyorum.""")
```
## Results
```bash
+-------------+---------+
|chunk |ner_label|
+-------------+---------+
|Cesur Yurttaş|PER |
|İstanbul'da |LOC |
+-------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_base_token_classifier_ner|
|Compatibility:|Spark NLP 3.3.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|tr|
|Case sensitive:|true|
|Max sentense length:|256|
## Data Source
[https://huggingface.co/akdeniz27/xlm-roberta-base-turkish-ner](https://huggingface.co/akdeniz27/xlm-roberta-base-turkish-ner)
## Benchmarking
```bash
accuracy: 0.9919343118732742
f1: 0.9492100796448622
precision: 0.9407349896480332
recall: 0.9578392621870883
```
---
layout: model
title: Sentence Embeddings - sbert medium (tuned)
author: John Snow Labs
name: sbert_jsl_medium_uncased
date: 2021-06-30
tags: [embeddings, clinical, licensed, en]
task: Embeddings
language: en
nav_key: models
edition: Healthcare NLP 3.1.0
spark_version: 2.4
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained to generate contextual sentence embeddings of input sentences.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_uncased_en_3.1.0_2.4_1625050209626.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_uncased_en_3.1.0_2.4_1625050209626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings")
```
```scala
val sbiobert_embeddings = BertSentenceEmbeddings
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")
.setInputCols(Array("sentence"))
.setOutputCol("sbert_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed_sentence.bert.jsl_medium_uncased").predict("""Put your text here.""")
```
## Results
```bash
Gives a 768 dimensional vector representation of the sentence.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbert_jsl_medium_uncased|
|Compatibility:|Healthcare NLP 3.1.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Case sensitive:|false|
## Data Source
Tuned on MedNLI dataset
## Benchmarking
```bash
MedNLI Acc: 0.724, STS (cos): 0.743
```
## Benchmarking
```bash
MedNLI Score
Acc 0.724
STS(cos) 0.743
```
---
layout: model
title: Arabic Bert Embeddings (MARBERT model)
author: John Snow Labs
name: bert_embeddings_MARBERT
date: 2022-04-11
tags: [bert, embeddings, ar, open_source]
task: Embeddings
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `MARBERT` is a Arabic model orginally trained by `UBC-NLP`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_MARBERT_ar_3.4.2_3.0_1649677129277.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_MARBERT_ar_3.4.2_3.0_1649677129277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_MARBERT","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_MARBERT","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.MARBERT").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_MARBERT|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|611.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/UBC-NLP/MARBERT
- https://doi.org/10.14288/SOCKEYE
- https://www.tensorflow.org/tfrc
---
layout: model
title: Legal Effect of termination Clause Binary Classifier (md)
author: John Snow Labs
name: legclf_effect_of_termination_md
date: 2023-01-11
tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `effect-of-termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `effect-of-termination`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_effect_of_termination_md_en_1.0.0_3.0_1673460267892.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_effect_of_termination_md_en_1.0.0_3.0_1673460267892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[effect-of-termination]|
|[other]|
|[other]|
|[effect-of-termination]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_effect_of_termination_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
precision recall f1-score support
conditions-precedent 0.91 0.88 0.89 24
other 0.93 0.95 0.94 39
accuracy 0.92 63
macro avg 0.92 0.91 0.92 63
weighted avg 0.92 0.92 0.92 63
```
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from nlpconnect)
author: John Snow Labs
name: roberta_qa_dpr_nq_reader_base
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dpr-nq-reader-roberta-base` is a English model originally trained by `nlpconnect`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_base_en_4.3.0_3.0_1674210699630.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_base_en_4.3.0_3.0_1674210699630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_base","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_base","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_dpr_nq_reader_base|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|466.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/nlpconnect/dpr-nq-reader-roberta-base
---
layout: model
title: Translate English to Romanian Pipeline
author: John Snow Labs
name: translate_en_ro
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, ro, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `ro`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ro_xx_2.7.0_2.4_1609687586572.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ro_xx_2.7.0_2.4_1609687586572.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_ro", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_ro", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.ro').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_ro|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Arabic BertForQuestionAnswering model (from bhavikardeshna)
author: John Snow Labs
name: bert_qa_multilingual_bert_base_cased_arabic
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: ar
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-arabic` is a Arabic model orginally trained by `bhavikardeshna`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_arabic_ar_4.0.0_3.0_1654188420264.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_arabic_ar_4.0.0_3.0_1654188420264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_bert_base_cased_arabic","ar") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_multilingual_bert_base_cased_arabic","ar")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.answer_question.bert.multilingual_arabic_tuned_base_cased.by_bhavikardeshna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_multilingual_bert_base_cased_arabic|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|ar|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-arabic
---
layout: model
title: Translate English to Lingala Pipeline
author: John Snow Labs
name: translate_en_ln
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, ln, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `ln`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ln_xx_2.7.0_2.4_1609698946668.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ln_xx_2.7.0_2.4_1609698946668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_ln", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_ln", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.ln').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_ln|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering model (from nlpunibo)
author: John Snow Labs
name: bert_qa_bert
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert` is a English model orginally trained by `nlpunibo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_en_4.0.0_3.0_1654179427441.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_en_4.0.0_3.0_1654179427441.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.by_nlpunibo").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/nlpunibo/bert
---
layout: model
title: Legal Processed Agricultural Produce Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_processed_agricultural_produce_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, processed_agricultural_produce, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_processed_agricultural_produce_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Processed_Agricultural_Produce or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Processed_Agricultural_Produce`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_processed_agricultural_produce_bert_en_1.0.0_3.0_1678111667311.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_processed_agricultural_produce_bert_en_1.0.0_3.0_1678111667311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Processed_Agricultural_Produce]|
|[Other]|
|[Other]|
|[Processed_Agricultural_Produce]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_processed_agricultural_produce_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.90 0.92 0.91 421
Processed_Agricultural_Produce 0.93 0.91 0.92 487
accuracy - - 0.92 908
macro-avg 0.91 0.92 0.91 908
weighted-avg 0.92 0.92 0.92 908
```
---
layout: model
title: Portuguese asr_bp500_xlsr TFWav2Vec2ForCTC from lgris
author: John Snow Labs
name: asr_bp500_xlsr
date: 2022-09-26
tags: [wav2vec2, pt, audio, open_source, asr]
task: Automatic Speech Recognition
language: pt
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp500_xlsr` is a Portuguese model originally trained by lgris.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp500_xlsr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp500_xlsr_pt_4.2.0_3.0_1664193982159.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp500_xlsr_pt_4.2.0_3.0_1664193982159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_bp500_xlsr", "pt")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_bp500_xlsr", "pt")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_bp500_xlsr|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|pt|
|Size:|756.2 MB|
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from nlpunibo)
author: John Snow Labs
name: distilbert_qa_base_config3
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_config3` is a English model originally trained by `nlpunibo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config3_en_4.3.0_3.0_1672774482106.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config3_en_4.3.0_3.0_1672774482106.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config3","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_config3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/nlpunibo/distilbert_base_config3
---
layout: model
title: Stance About Health Mandates Related to Covid-19 Classifier (BioBERT)
author: John Snow Labs
name: bert_sequence_classifier_health_mandates_stance_tweet
date: 2022-08-08
tags: [en, clinical, licensed, public_health, classifier, sequence_classification, covid_19, tweet, stance, mandate]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 4.0.2
spark_version: 3.0
supported: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify stance about health mandates related to Covid-19 from tweets.
This model is intended for direct use as a classification model and the target classes are: Support, Disapproval, Not stated.
## Predicted Entities
`Support`, `Disapproval`, `Not stated`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_MANDATES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mandates_stance_tweet_en_4.0.2_3.0_1659982585130.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mandates_stance_tweet_en_4.0.2_3.0_1659982585130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mandates_stance_tweet", "en", "clinical/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame(["""It's too dangerous to hold the RNC, but let's send students and teachers back to school.""",
"""So is the flu and pneumonia what are their s stop the Media Manipulation covid has treatments Youre Speaker Pelosi nephew so stop the agenda LIES.""",
"""Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other.""",
"""Go to a restaurant no mask Do a food shop wear a mask INCONSISTENT No Masks No Masks.""",
"""But if schools close who is gonna occupy those graves Cause politiciansprotected smokers protected drunkardsprotected school kids amp teachers""",
"""New title Maskhole I think Im going to use this very soon coronavirus."""], StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("text", "class.result").show(truncate=False)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mandates_stance_tweet", "es", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq(Array("It's too dangerous to hold the RNC, but let's send students and teachers back to school",
"So is the flu and pneumonia what are their s stop the Media Manipulation covid has treatments Youre Speaker Pelosi nephew so stop the agenda LIES",
"Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other",
"Go to a restaurant no mask Do a food shop wear a mask INCONSISTENT No Masks No Masks.",
"But if schools close who is gonna occupy those graves Cause politiciansprotected smokers protected drunkardsprotected school kids amp teachers",
"New title Maskhole I think Im going to use this very soon coronavirus.")).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.health_stance").predict("""Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other.""")
```
## Results
```bash
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|text |result |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|It's too dangerous to hold the RNC, but let's send students and teachers back to school. |[Support] |
|So is the flu and pneumonia what are their s stop the Media Manipulation covid has treatments Youre Speaker Pelosi nephew so stop the agenda LIES. |[Disapproval]|
|Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other.|[Not stated] |
|Go to a restaurant no mask Do a food shop wear a mask INCONSISTENT No Masks No Masks. |[Disapproval]|
|But if schools close who is gonna occupy those graves Cause politiciansprotected smokers protected drunkardsprotected school kids amp teachers |[Support] |
|New title Maskhole I think Im going to use this very soon coronavirus. |[Not stated] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_health_mandates_stance_tweet|
|Compatibility:|Healthcare NLP 4.0.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
The dataset is Covid-19-specific and consists of tweets collected via a series of keywords associated with that disease.
## Benchmarking
```bash
label precision recall f1-score support
Disapproval 0.70 0.64 0.67 158
Not_stated 0.75 0.78 0.76 244
Support 0.73 0.74 0.74 197
accuracy - - 0.73 599
macro-avg 0.72 0.72 0.72 599
weighted-avg 0.73 0.73 0.73 599
```
---
layout: model
title: Explain Document Pipeline for Polish
author: John Snow Labs
name: explain_document_sm
date: 2021-03-22
tags: [open_source, polish, explain_document_sm, pipeline, pl]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: pl
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_pl_3.0.0_3.0_1616423208721.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_pl_3.0.0_3.0_1616423208721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_sm', lang = 'pl')
annotations = pipeline.fullAnnotate(""Witaj z John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_sm", lang = "pl")
val result = pipeline.fullAnnotate("Witaj z John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Witaj z John Snow Labs! ""]
result_df = nlu.load('pl.explain').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Witaj z John Snow Labs! '] | ['Witaj z John Snow Labs!'] | ['Witaj', 'z', 'John', 'Snow', 'Labs!'] | ['witać', 'z', 'John', 'Snow', 'Labs!'] | ['VERB', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_sm|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|pl|
---
layout: model
title: French CamemBert Embeddings (from safik)
author: John Snow Labs
name: camembert_embeddings_safik_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `safik`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_safik_generic_model_fr_3.4.4_3.0_1653990191365.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_safik_generic_model_fr_3.4.4_3.0_1653990191365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_safik_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_safik_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_safik_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/safik/dummy-model
---
layout: model
title: Extract Granular Anatomical Entities from Oncology Texts
author: John Snow Labs
name: ner_oncology_anatomy_granular
date: 2022-11-24
tags: [licensed, clinical, en, oncology, ner, anatomy]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts mentions of anatomical entities using granular labels.
## Predicted Entities
`Direction`, `Site_Lymph_Node`, `Site_Breast`, `Site_Other_Body_Part`, `Site_Bone`, `Site_Liver`, `Site_Lung`, `Site_Brain`
Definitions of Predicted Entities:
- `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower".
- `Site_Bone`: Anatomical terms that refer to the human skeleton.
- `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum).
- `Site_Breast`: Anatomical terms that refer to the breasts.
- `Site_Liver`: Anatomical terms that refer to the liver.
- `Site_Lung`: Anatomical terms that refer to the lungs.
- `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies.
- `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities.
## Predicted Entities
`Direction`, `Site_Bone`, `Site_Brain`, `Site_Breast`, `Site_Liver`, `Site_Lung`, `Site_Lymph_Node`, `Site_Other_Body_Part`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_en_4.2.2_3.0_1669299394344.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_en_4.2.2_3.0_1669299394344.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_anatomy_granular", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_anatomy_granular", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_anatomy_granular").predict("""The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.""")
```
## Results
```bash
| chunk | ner_label |
|:--------|:------------|
| left | Direction |
| breast | Site_Breast |
| lungs | Site_Lung |
| liver | Site_Liver |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_anatomy_granular|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|34.3 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Direction 822 221 162 984 0.79 0.84 0.81
Site_Lymph_Node 481 38 70 551 0.93 0.87 0.90
Site_Breast 88 14 59 147 0.86 0.60 0.71
Site_Other_Body_Part 604 184 897 1501 0.77 0.40 0.53
Site_Bone 252 74 61 313 0.77 0.81 0.79
Site_Liver 178 92 56 234 0.66 0.76 0.71
Site_Lung 398 98 161 559 0.80 0.71 0.75
Site_Brain 197 44 82 279 0.82 0.71 0.76
macro_avg 3020 765 1548 4568 0.80 0.71 0.74
micro_avg 3020 765 1548 4568 0.80 0.66 0.71
```
---
layout: model
title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot TFWav2Vec2ForCTC from aapot
author: John Snow Labs
name: asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot` is a Finnish model originally trained by aapot.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_fi_4.2.0_3.0_1664022293307.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_fi_4.2.0_3.0_1664022293307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot", "fi")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot", "fi")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|fi|
|Size:|3.6 GB|
---
layout: model
title: Detect PHI for Deidentification purposes (Italian)
author: John Snow Labs
name: ner_deid_subentity
date: 2022-03-22
tags: [deid, it, licensed]
task: Named Entity Recognition
language: it
edition: Healthcare NLP 3.4.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN.
Deidentification NER (Italian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 19 entities. This NER model is trained with a custom dataset internally annotated, a COVID-19 Italian de-identification research dataset making up 15% of the total data [(Catelli et al.)](https://ieeexplore.ieee.org/document/9335570) and several data augmentation mechanisms.
## Predicted Entities
`DATE`, `AGE`, `SEX`, `PROFESSION`, `ORGANIZATION`, `PHONE`, `EMAIL`, `ZIP`, `STREET`, `CITY`, `COUNTRY`, `PATIENT`, `DOCTOR`, `HOSPITAL`, `MEDICALRECORD`, `SSN`, `IDNUM`, `USERNAME`, `URL`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_it_3.4.2_3.0_1647983756765.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_it_3.4.2_3.0_1647983756765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\
.setInputCols(["sentence", "token"])\
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "it", "clinical/models")\
.setInputCols(["sentence","token", "word_embeddings"])\
.setOutputCol("ner")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner])
text = ["Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."]
data = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "it", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner))
val text = "Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."
val data = Seq(text).toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("it.med_ner.deid_subentity").predict("""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""")
```
## Results
```bash
+-------------+----------+
| token| ner_label|
+-------------+----------+
| Ho| O|
| visto| O|
| Gastone| B-PATIENT|
|Montanariello| I-PATIENT|
| (| O|
| 49| B-AGE|
| anni| O|
| )| O|
| riferito| O|
| all| O|
| '| O|
| Ospedale|B-HOSPITAL|
| San|I-HOSPITAL|
| Camillo|I-HOSPITAL|
| per| O|
| diabete| O|
| mal| O|
| controllato| O|
| con| O|
| sintomi| O|
| risalenti| O|
| a| O|
| marzo| B-DATE|
| 2015| I-DATE|
| .| O|
+-------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_subentity|
|Compatibility:|Healthcare NLP 3.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|it|
|Size:|15.0 MB|
## References
- Internally annotated corpus
- [COVID-19 Italian de-identification dataset making up 15% of total data: R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita and M. Esposito, "A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records," in IEEE Access, vol. 9, pp. 19097-19110, 2021, doi: 10.1109/ACCESS.2021.3054479.](https://ieeexplore.ieee.org/document/9335570)
## Benchmarking
```bash
label tp fp fn total precision recall f1
PATIENT 263.0 29.0 25.0 288.0 0.9007 0.9132 0.9069
HOSPITAL 365.0 36.0 48.0 413.0 0.9102 0.8838 0.8968
DATE 1164.0 13.0 26.0 1190.0 0.989 0.9782 0.9835
ORGANIZATION 72.0 25.0 26.0 98.0 0.7423 0.7347 0.7385
URL 41.0 0.0 0.0 41.0 1.0 1.0 1.0
CITY 421.0 9.0 19.0 440.0 0.9791 0.9568 0.9678
STREET 198.0 4.0 6.0 204.0 0.9802 0.9706 0.9754
USERNAME 20.0 2.0 2.0 22.0 0.9091 0.9091 0.9091
SEX 753.0 26.0 21.0 774.0 0.9666 0.9729 0.9697
IDNUM 113.0 3.0 7.0 120.0 0.9741 0.9417 0.9576
EMAIL 148.0 0.0 0.0 148.0 1.0 1.0 1.0
ZIP 148.0 3.0 1.0 149.0 0.9801 0.9933 0.9867
MEDICALRECORD 19.0 3.0 6.0 25.0 0.8636 0.76 0.8085
SSN 13.0 1.0 1.0 14.0 0.9286 0.9286 0.9286
PROFESSION 316.0 28.0 53.0 369.0 0.9186 0.8564 0.8864
PHONE 53.0 0.0 2.0 55.0 1.0 0.9636 0.9815
COUNTRY 182.0 14.0 15.0 197.0 0.9286 0.9239 0.9262
DOCTOR 769.0 77.0 62.0 831.0 0.909 0.9254 0.9171
AGE 763.0 8.0 18.0 781.0 0.9896 0.977 0.9832
macro - - - - - - 0.9328
micro - - - - - - 0.9494
```
---
layout: model
title: Translate Igbo to English Pipeline
author: John Snow Labs
name: translate_ig_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, ig, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `ig`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ig_en_xx_2.7.0_2.4_1609690747646.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ig_en_xx_2.7.0_2.4_1609690747646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_ig_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_ig_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.ig.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_ig_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Translate English to Luba-Katanga Pipeline
author: John Snow Labs
name: translate_en_lu
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, lu, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `lu`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_lu_xx_2.7.0_2.4_1609701845697.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_lu_xx_2.7.0_2.4_1609701845697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_lu", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_lu", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.lu').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_lu|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English Bert Embeddings (Base, Uncased, Agriculture)
author: John Snow Labs
name: bert_embeddings_agriculture_bert_uncased
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `agriculture-bert-uncased` is a English model orginally trained by `recobo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_agriculture_bert_uncased_en_3.4.2_3.0_1649672401296.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_agriculture_bert_uncased_en_3.4.2_3.0_1649672401296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_agriculture_bert_uncased","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_agriculture_bert_uncased","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.agriculture_bert_uncased").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_agriculture_bert_uncased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|412.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/recobo/agriculture-bert-uncased
---
layout: model
title: Malay T5ForConditionalGeneration Tiny Cased model (from mesolitica)
author: John Snow Labs
name: t5_finetune_translation_tiny_standard_bahasa_cased
date: 2023-01-30
tags: [ms, open_source, t5, tensorflow]
task: Text Generation
language: ms
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-translation-t5-tiny-standard-bahasa-cased` is a Malay model originally trained by `mesolitica`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetune_translation_tiny_standard_bahasa_cased_ms_4.3.0_3.0_1675102243431.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetune_translation_tiny_standard_bahasa_cased_ms_4.3.0_3.0_1675102243431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_finetune_translation_tiny_standard_bahasa_cased","ms") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_finetune_translation_tiny_standard_bahasa_cased","ms")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_finetune_translation_tiny_standard_bahasa_cased|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|ms|
|Size:|176.9 MB|
## References
- https://huggingface.co/mesolitica/finetune-translation-t5-tiny-standard-bahasa-cased
- https://github.com/huseinzol05/malay-dataset/tree/master/translation/laser
- https://github.com/huseinzol05/malaya/tree/master/session/translation/hf-t5
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6_en_4.0.0_3.0_1657184079074.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6_en_4.0.0_3.0_1657184079074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-6
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from Akshat)
author: John Snow Labs
name: xlmroberta_ner_akshat_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Akshat`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_akshat_base_finetuned_panx_de_4.1.0_3.0_1660429170020.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_akshat_base_finetuned_panx_de_4.1.0_3.0_1660429170020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_akshat_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_akshat_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_akshat_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Akshat/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Pipeline to Detect Radiology Related Entities
author: John Snow Labs
name: ner_radiology_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_radiology](https://nlp.johnsnowlabs.com/2021/03/31/ner_radiology_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RADIOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_RADIOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_radiology_pipeline_en_3.4.1_3.0_1647874212591.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_radiology_pipeline_en_3.4.1_3.0_1647874212591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_radiology_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_radiology_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.radiology.pipeline").predict("""Breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_exper7_mesum5", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_exper7_mesum5", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_exper7_mesum5|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|322.3 MB|
---
layout: model
title: Pipeline to Detect Anatomical References (biobert)
author: John Snow Labs
name: ner_anatomy_biobert_pipeline
date: 2023-03-20
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_anatomy_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_anatomy_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_pipeline_en_4.3.0_3.2_1679312126242.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_pipeline_en_4.3.0_3.2_1679312126242.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_anatomy_biobert_pipeline", "en", "clinical/models")
text = '''This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.
General: Well-developed female, in no acute distress, afebrile.
HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.
Neck: No lymphadenopathy.
Chest: Clear.
Abdomen: Positive bowel sounds and soft.
Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_anatomy_biobert_pipeline", "en", "clinical/models")
val text = "This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.
General: Well-developed female, in no acute distress, afebrile.
HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.
Neck: No lymphadenopathy.
Chest: Clear.
Abdomen: Positive bowel sounds and soft.
Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.anatomy_biobert.pipeline").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.
General: Well-developed female, in no acute distress, afebrile.
HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.
Neck: No lymphadenopathy.
Chest: Clear.
Abdomen: Positive bowel sounds and soft.
Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_bert","ro","clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter])
sample_text = """ Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min."""
data = spark.createDataFrame([[sample_text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical_bert", "ro", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter))
val data = Seq("""Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ro.embed.clinical.bert.base_cased").predict(""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""")
```
## Results
```bash
+--------------------------+-------------------------+
|chunks |entities |
+--------------------------+-------------------------+
|Angio CT cardio-toracic |Imaging_Test |
|Atrezie |Disease_Syndrome_Disorder|
|valva pulmonara |Body_Part |
|Hipoplazie |Disease_Syndrome_Disorder|
|VS |Body_Part |
|Atrezie |Disease_Syndrome_Disorder|
|VAV stang |Body_Part |
|Anastomoza Glenn |Disease_Syndrome_Disorder|
|Tromboza |Disease_Syndrome_Disorder|
|Sectia Clinica Cardiologie|Clinical_Dept |
|GE Revolution HD |Medical_Device |
|Branula albastra |Medical_Device |
|membrului superior drept |Body_Part |
|Scout |Body_Part |
|30 ml |Dosage |
|Iomeron 350 |Drug_Ingredient |
|2.2 ml/s |Dosage |
|20 ml |Dosage |
|ser fiziologic |Drug_Ingredient |
|angio-CT |Imaging_Test |
+--------------------------+-------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_clinical_bert|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ro|
|Size:|16.3 MB|
## Benchmarking
```bash
label precision recall f1-score support
Body_Part 0.91 0.93 0.92 679
Clinical_Dept 0.68 0.65 0.67 97
Date 0.99 0.99 0.99 87
Direction 0.66 0.76 0.70 50
Disease_Syndrome_Disorder 0.73 0.76 0.74 121
Dosage 0.78 1.00 0.87 38
Drug_Ingredient 0.90 0.94 0.92 48
Form 1.00 1.00 1.00 6
Imaging_Findings 0.86 0.82 0.84 201
Imaging_Technique 0.92 0.92 0.92 26
Imaging_Test 0.93 0.98 0.95 205
Measurements 0.71 0.69 0.70 214
Medical_Device 0.85 0.81 0.83 42
Pulse 0.82 1.00 0.90 9
Route 1.00 0.91 0.95 33
Score 1.00 0.98 0.99 41
Time 1.00 1.00 1.00 28
Units 0.60 0.93 0.73 88
Weight 0.82 1.00 0.90 9
micro-avg 0.84 0.87 0.86 2037
macro-avg 0.70 0.74 0.72 2037
weighted-avg 0.84 0.87 0.85 2037
```
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from IDL)
author: John Snow Labs
name: distilbert_qa_autotrain_qna_1170143354
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-qna-1170143354` is a English model originally trained by `IDL`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_autotrain_qna_1170143354_en_4.3.0_3.0_1672765675805.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_autotrain_qna_1170143354_en_4.3.0_3.0_1672765675805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autotrain_qna_1170143354","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autotrain_qna_1170143354","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_autotrain_qna_1170143354|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/IDL/autotrain-qna-1170143354
---
layout: model
title: Legal Employment Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_employment_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, employment, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_employment_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Employment or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Employment`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employment_bert_en_1.0.0_3.0_1678111724799.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employment_bert_en_1.0.0_3.0_1678111724799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Employment]|
|[Other]|
|[Other]|
|[Employment]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_employment_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Employment 0.90 0.89 0.89 70
Other 0.87 0.89 0.88 61
accuracy - - 0.89 131
macro-avg 0.88 0.89 0.89 131
weighted-avg 0.89 0.89 0.89 131
```
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from khanglam7012)
author: John Snow Labs
name: t5_small
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small` is a English model originally trained by `khanglam7012`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_en_4.3.0_3.0_1675125819094.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_en_4.3.0_3.0_1675125819094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_small","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_small","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_small|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|253.6 MB|
## References
- https://huggingface.co/khanglam7012/t5-small
- https://user-images.githubusercontent.com/49101362/116334480-f5e57a00-a7dd-11eb-987c-186477f94b6e.png
- https://pypi.org/project/keytotext/
- https://pepy.tech/project/keytotext
- https://colab.research.google.com/github/gagan3012/keytotext/blob/master/Examples/K2T.ipynb
- https://share.streamlit.io/gagan3012/keytotext/UI/app.py
- https://github.com/gagan3012/keytotext/tree/master/Training%20Notebooks
- https://colab.research.google.com/github/gagan3012/keytotext/blob/master/Examples/K2T.ipynb
- https://github.com/gagan3012/keytotext/tree/master/Examples
- https://user-images.githubusercontent.com/49101362/116220679-90e64180-a755-11eb-9246-82d93d924a6c.png
- https://share.streamlit.io/gagan3012/keytotext/UI/app.py
- https://github.com/gagan3012/streamlit-tags
- https://user-images.githubusercontent.com/49101362/116162205-fc042980-a6fd-11eb-892e-8f6902f193f4.png
---
layout: model
title: Part of Speech for Irish
author: John Snow Labs
name: pos_ud_idt
date: 2020-07-29 23:34:00 +0800
task: Part of Speech Tagging
language: ga
edition: Spark NLP 2.5.5
spark_version: 2.4
tags: [pos, ga]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_idt_ga_2.5.5_2.4_1596054150271.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_idt_ga_2.5.5_2.4_1596054150271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_idt", "ga") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine.")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_idt", "ga")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine."""]
pos_df = nlu.load('ga.pos').predict(text, output_level='token')
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=6, result='ADP', metadata={'word': 'Seachas'}),
Row(annotatorType='pos', begin=8, end=8, result='PART', metadata={'word': 'a'}),
Row(annotatorType='pos', begin=10, end=15, result='NOUN', metadata={'word': 'bheith'}),
Row(annotatorType='pos', begin=17, end=19, result='ADP', metadata={'word': 'ina'}),
Row(annotatorType='pos', begin=21, end=22, result='NOUN', metadata={'word': 'rí'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_idt|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.5+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|ga|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Stopwords Remover for Bulgarian language (405 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, bg, open_source]
task: Stop Words Removal
language: bg
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_bg_3.4.1_3.0_1646672949029.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_bg_3.4.1_3.0_1646672949029.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","bg") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Не си по-добър от мен"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","bg")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Не си по-добър от мен").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("bg.stopwords").predict("""Не си по-добър от мен""")
```
## Results
```bash
+--------------+
|result |
+--------------+
|[Не, по-добър]|
+--------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|bg|
|Size:|3.0 KB|
---
layout: model
title: Pipeline to Detect clinical events (biobert)
author: John Snow Labs
name: ner_events_biobert_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_events_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_events_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_biobert_pipeline_en_3.4.1_3.0_1647873577802.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_biobert_pipeline_en_3.4.1_3.0_1647873577802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_events_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("The patient presented to the emergency room last evening")
```
```scala
val pipeline = new PretrainedPipeline("ner_events_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("The patient presented to the emergency room last evening")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.biobert_events.pipeline").predict("""The patient presented to the emergency room last evening""")
```
## Results
```bash
+------------------+-------------+
|chunks |entities |
+------------------+-------------+
|presented |OCCURRENCE |
|the emergency room|CLINICAL_DEPT|
+------------------+-------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_events_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.0 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from Shappey)
author: John Snow Labs
name: roberta_qa_base_qna_squad2_trained
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-QnA-squad2-trained` is a English model originally trained by `Shappey`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_qna_squad2_trained_en_4.3.0_3.0_1674212572106.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_qna_squad2_trained_en_4.3.0_3.0_1674212572106.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_qna_squad2_trained","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_qna_squad2_trained","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_qna_squad2_trained|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|456.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Shappey/roberta-base-QnA-squad2-trained
---
layout: model
title: English T5ForConditionalGeneration Cased model (from google)
author: John Snow Labs
name: t5_efficient_xl_nl2
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-xl-nl2` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_xl_nl2_en_4.3.0_3.0_1675124205513.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_xl_nl2_en_4.3.0_3.0_1675124205513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_xl_nl2","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_xl_nl2","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_xl_nl2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|575.7 MB|
## References
- https://huggingface.co/google/t5-efficient-xl-nl2
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Hindi Bert Embeddings (from monsoon-nlp)
author: John Snow Labs
name: bert_embeddings_muril_adapted_local
date: 2022-04-11
tags: [bert, embeddings, hi, open_source]
task: Embeddings
language: hi
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a Hindi model orginally trained by `monsoon-nlp`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_hi_3.4.2_3.0_1649673217108.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_hi_3.4.2_3.0_1649673217108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","hi") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","hi")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("hi.embed.muril_adapted_local").predict("""मुझे स्पार्क एनएलपी पसंद है""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_muril_adapted_local|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|hi|
|Size:|888.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/monsoon-nlp/muril-adapted-local
- https://tfhub.dev/google/MuRIL/1
---
layout: model
title: Fast Neural Machine Translation Model from Romance Languages to English
author: John Snow Labs
name: opus_mt_roa_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, roa, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `roa`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_roa_en_xx_2.7.0_2.4_1609166767520.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_roa_en_xx_2.7.0_2.4_1609166767520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_roa_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_roa_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.roa.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_roa_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Multilingual XLMRobertaForTokenClassification Base Cased model (from Neha2608)
author: John Snow Labs
name: xlmroberta_ner_neha2608_base_finetuned_panx_all
date: 2022-08-13
tags: [xx, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `Neha2608`.
## Predicted Entities
`ORG`, `LOC`, `PER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_neha2608_base_finetuned_panx_all_xx_4.1.0_3.0_1660427653138.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_neha2608_base_finetuned_panx_all_xx_4.1.0_3.0_1660427653138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_neha2608_base_finetuned_panx_all","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_neha2608_base_finetuned_panx_all","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_neha2608_base_finetuned_panx_all|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|861.7 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Neha2608/xlm-roberta-base-finetuned-panx-all
---
layout: model
title: German asr_wav2vec2_large_xlsr_53_German TFWav2Vec2ForCTC from MehdiHosseiniMoghadam
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_German
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_German` is a German model originally trained by MehdiHosseiniMoghadam.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_German_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_German_de_4.2.0_3.0_1664107471966.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_German_de_4.2.0_3.0_1664107471966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_German', lang = 'de')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_German", lang = "de")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_German|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English ElectraForQuestionAnswering model (from howey)
author: John Snow Labs
name: electra_qa_large_squad
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-large-squad` is a English model originally trained by `howey`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_large_squad_en_4.0.0_3.0_1655920994017.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_large_squad_en_4.0.0_3.0_1655920994017.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_large_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_large_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.electra.large").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_robot22", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_robot22", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_robot22|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|322.3 MB|
---
layout: model
title: German T5ForConditionalGeneration Cased model (from diversifix)
author: John Snow Labs
name: t5_diversiformer
date: 2023-01-30
tags: [de, open_source, t5, tensorflow]
task: Text Generation
language: de
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `diversiformer` is a German model originally trained by `diversifix`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_diversiformer_de_4.3.0_3.0_1675100976411.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_diversiformer_de_4.3.0_3.0_1675100976411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_diversiformer","de") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_diversiformer","de")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_diversiformer|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|de|
|Size:|1.2 GB|
## References
- https://huggingface.co/diversifix/diversiformer
- https://arxiv.org/abs/2010.11934
- https://github.com/diversifix/diversiformer
- https://www.gnu.org/licenses/
---
layout: model
title: Spanish RobertaForQuestionAnswering (from hackathon-pln-es)
author: John Snow Labs
name: roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln
date: 2022-06-21
tags: [es, open_source, question_answering, roberta]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-biomedical-es-squad2-es` is a Spanish model originally trained by `hackathon-pln-es`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln_es_4.0.0_3.0_1655790263349.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln_es_4.0.0_3.0_1655790263349.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squadv2_bio_medical.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|es|
|Size:|465.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/hackathon-pln-es/roberta-base-biomedical-es-squad2-es
- https://somosnlp.org/hackathon
---
layout: model
title: Pipeline to Extract Oncology Tests
author: John Snow Labs
name: ner_oncology_test_pipeline
date: 2023-03-09
tags: [licensed, clinical, oncology, en, ner, test]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_oncology_test](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_test_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_pipeline_en_4.3.0_3.2_1678351357734.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_pipeline_en_4.3.0_3.2_1678351357734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_oncology_test_pipeline", "en", "clinical/models")
text = ''' biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_oncology_test_pipeline", "en", "clinical/models")
val text = " biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:--------------------------|--------:|------:|:---------------|-------------:|
| 0 | biopsy | 1 | 6 | Pathology_Test | 0.9987 |
| 1 | ultrasound guided | 31 | 47 | Imaging_Test | 0.87635 |
| 2 | chest computed tomography | 67 | 91 | Imaging_Test | 0.9176 |
| 3 | CT | 94 | 95 | Imaging_Test | 0.8294 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_test_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Translate English to Sino-Tibetan languages Pipeline
author: John Snow Labs
name: translate_en_sit
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, sit, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `sit`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sit_xx_2.7.0_2.4_1609691674420.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sit_xx_2.7.0_2.4_1609691674420.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_sit", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_sit", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.sit').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_sit|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Yoruba Named Entity Recognition (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba
date: 2022-05-17
tags: [xlm_roberta, ner, token_classification, yo, open_source]
task: Named Entity Recognition
language: yo
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-yoruba-finetuned-ner-yoruba` is a Yoruba model orginally trained by `mbeukman`.
## Predicted Entities
`PER`, `ORG`, `LOC`, `DATE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba_yo_3.4.2_3.0_1652808837178.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba_yo_3.4.2_3.0_1652808837178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba","yo") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Mo nifẹ Snark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba","yo")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Mo nifẹ Snark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|yo|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-yoruba-finetuned-ner-yoruba
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://www.apache.org/licenses/LICENSE-2.0
- https://github.com/Michael-Beukman
---
layout: model
title: Word2Vec Embeddings in Malagasy (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, mg, open_source]
task: Embeddings
language: mg
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mg_3.4.1_3.0_1647444052051.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mg_3.4.1_3.0_1647444052051.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mg") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Tiako ny spark nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mg")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Tiako ny spark nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("mg.embed.w2v_cc_300d").predict("""Tiako ny spark nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|mg|
|Size:|233.5 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Pipeline to Detect Cancer Genetics
author: John Snow Labs
name: ner_bionlp_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_bionlp](https://nlp.johnsnowlabs.com/2021/03/31/ner_bionlp_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_pipeline_en_3.4.1_3.0_1647871349979.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_pipeline_en_3.4.1_3.0_1647871349979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_bionlp_pipeline", "en", "clinical/models")
pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.")
```
```scala
val pipeline = new PretrainedPipeline("ner_bionlp_pipeline", "en", "clinical/models")
pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.bionlp.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""")
```
## Results
```bash
+----------------------+--------------------+
|chunk |ner_label |
+----------------------+--------------------+
|human |Organism |
|Kir 3.3 |Gene_or_gene_product|
|GIRK3 |Gene_or_gene_product|
|potassium |Simple_chemical |
|GIRK |Gene_or_gene_product|
|chromosome 1q21-23 |Cellular_component |
|pancreas |Organ |
|tissues |Tissue |
|fat andskeletal muscle|Tissue |
|KCNJ9 |Gene_or_gene_product|
|Type II |Gene_or_gene_product|
+----------------------+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_bionlp_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Detect Persons, Locations, Organizations and Misc Entities in Polish (WikiNER 6B 100)
author: John Snow Labs
name: wikiner_6B_100
date: 2020-05-10
task: Named Entity Recognition
language: pl
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [ner, pl, open_source]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline.
{:.h2_title}
## Predicted Entities
Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_PL){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_PL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_pl_2.5.0_2.4_1588519719293.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_pl_2.5.0_2.4_1588519719293.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained("glove_100d") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained("wikiner_6B_100", "pl") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę.']], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("glove_100d")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("wikiner_6B_100", "pl")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę."""]
ner_df = nlu.load('pl.ner.wikiner.glove.6B_100').predict(text, output_level = "chunk")
ner_df[["entities", "entities_confidence"]]
```
{:.h2_title}
## Results
```bash
+-------------------------------+---------+
|chunk |ner_label|
+-------------------------------+---------+
|William Henry Gates III |PER |
|Microsoft Corporation |ORG |
|Podczas swojej kariery |MISC |
|Microsoft Gates |MISC |
|CEO |ORG |
|Urodzony |LOC |
|Seattle |LOC |
|Waszyngton |LOC |
|Gates |PER |
|Microsoftu |ORG |
|Paulem Allenem |PER |
|Albuquerque |LOC |
|Nowym Meksyku |LOC |
|Gates |PER |
|Ale |PER |
|Gates |PER |
|Opinię |PER |
|Gates |PER |
|Microsoft |ORG |
|Bill & Melinda Gates Foundation|ORG |
+-------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wikiner_6B_100|
|Type:|ner|
|Compatibility:| Spark NLP 2.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|pl|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model was trained based on data from [https://pl.wikipedia.org](https://pl.wikipedia.org)
---
layout: model
title: Fast Neural Machine Translation Model from Shona to English
author: John Snow Labs
name: opus_mt_sn_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, sn, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `sn`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sn_en_xx_2.7.0_2.4_1609167066503.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sn_en_xx_2.7.0_2.4_1609167066503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_sn_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_sn_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.sn.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_sn_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Spanish RobertaForQuestionAnswering Base Cased model (from IIC)
author: John Snow Labs
name: roberta_qa_base_spanish_s_c
date: 2022-12-02
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-sqac` is a Spanish model originally trained by `IIC`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_s_c_es_4.2.4_3.0_1669986419016.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_s_c_es_4.2.4_3.0_1669986419016.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_s_c","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_s_c","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_spanish_s_c|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|460.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/IIC/roberta-base-spanish-sqac
- https://www.bsc.es/
- https://arxiv.org/abs/2107.07253
- https://paperswithcode.com/sota?task=question-answering&dataset=PlanTL-GOB-ES%2FSQAC
---
layout: model
title: Text Detection
author: John Snow Labs
name: text_detection_v1
date: 2021-12-09
tags: [en, licensed]
task: OCR Text Detection & Recognition
language: en
nav_key: models
edition: Visual NLP 3.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
CRAFT: Character-Region Awareness For Text detection, is designed with a convolutional neural network producing the character region score and affinity score. The region score is used to localize individual characters in the image, and the affinity score is used to group each character into a single instance. To compensate for the lack of character-level annotations, we propose a weaklysupervised learning framework that estimates characterlevel ground truths in existing real word-level datasets.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/text_detection_v1_en_3.0.0_3.0_1639033905025.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/text_detection_v1_en_3.0.0_3.0_1639033905025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|text_detection_v1|
|Type:|ocr|
|Compatibility:|Visual NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Output Labels:|[text_regions]|
|Language:|en|
---
layout: model
title: English RobertaForSequenceClassification Cased model (from mrm8488)
author: John Snow Labs
name: roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis
date: 2022-07-13
tags: [en, open_source, roberta, sequence_classification]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-finetuned-financial-news-sentiment-analysis` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis_en_4.0.0_3.0_1657716075006.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis_en_4.0.0_3.0_1657716075006.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis","en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|309.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from juanmarmol)
author: John Snow Labs
name: distilbert_qa_juanmarmol_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `juanmarmol`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_juanmarmol_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771544745.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_juanmarmol_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771544745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_juanmarmol_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_juanmarmol_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_juanmarmol_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/juanmarmol/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Universal Sentence Encoder XLING English and French
author: John Snow Labs
name: tfhub_use_xling_en_fr
date: 2020-12-08
task: Embeddings
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
deprecated: true
tags: [open_source, embeddings, xx]
supported: true
annotator: UniversalSentenceEncoder
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The Universal Sentence Encoder Cross-lingual (XLING) module is an extension of the Universal Sentence Encoder that includes training on multiple tasks across languages. The multi-task training setup is based on the paper "Learning Cross-lingual Sentence Representations via a Multi-task Dual Encoder".
This specific module is trained on English and French (en-fr) tasks, and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.
It is trained on a variety of data sources and tasks, with the goal of learning text representations that are useful out-of-the-box for a number of applications. The input to the module is variable length English or French text and the output is a 512 dimensional vector.
We note that one does not need to specify the language that the input is in, as the model was trained such that English and French text with similar meanings will have similar (high dot product score) embeddings. We also note that this model can be used for monolingual English (and potentially monolingual French) tasks with comparable or even better performance than the purely English Universal Sentence Encoder.
Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_en_fr_xx_2.7.0_2.4_1607440713842.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_en_fr_xx_2.7.0_2.4_1607440713842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_en_fr", "xx") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP', "J'adore utiliser SparkNLP"]], ["text"]))
```
```scala
...
val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_en_fr", "xx")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I love NLP", "J'adore utiliser SparkNLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP", "J'adore utiliser SparkNLP"]
embeddings_df = nlu.load('xx.use.xling_en_fr').predict(text, output_level='sentence')
embeddings_df
```
## Results
It gives a 512-dimensional vector of the sentences.
```bash
sentence xx_use_xling_en_fr_embeddings
0 I love NLP [0.0608731247484684, -0.06734627485275269, -0....
1 J'adore utiliser SparkNLP [0.07564588636159897, -0.06953935325145721, 0....
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|tfhub_use_xling_en_fr|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|xx|
## Data Source
This embeddings model is imported from [https://tfhub.dev/google/universal-sentence-encoder-xling/en-fr/1](https://tfhub.dev/google/universal-sentence-encoder-xling/en-fr/1)
---
layout: model
title: Lithuanian BertForMaskedLM Base Cased model (from Geotrend)
author: John Snow Labs
name: bert_embeddings_base_lt_cased
date: 2022-12-02
tags: [lt, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: lt
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-lt-cased` is a Lithuanian model originally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_lt_cased_lt_4.2.4_3.0_1670018323172.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_lt_cased_lt_4.2.4_3.0_1670018323172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_lt_cased","lt") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_lt_cased","lt")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_lt_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|lt|
|Size:|369.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-lt-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: English T5ForConditionalGeneration Cased model (from Apoorva)
author: John Snow Labs
name: t5_apoorva_k2t_test
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `k2t-test` is a English model originally trained by `Apoorva`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_apoorva_k2t_test_en_4.3.0_3.0_1675103912014.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_apoorva_k2t_test_en_4.3.0_3.0_1675103912014.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_apoorva_k2t_test","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_apoorva_k2t_test","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_apoorva_k2t_test|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|275.8 MB|
## References
- https://huggingface.co/Apoorva/k2t-test
---
layout: model
title: Lemmatizer (Norwegian Bokmål, SpacyLookup)
author: John Snow Labs
name: lemma_spacylookup
date: 2022-03-08
tags: [open_source, lemmatizer, nb]
task: Lemmatization
language: nb
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Norwegian Bokmål Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_nb_3.4.1_3.0_1646753600988.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_nb_3.4.1_3.0_1646753600988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","nb") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["Du er ikke bedre enn meg"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","nb")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer))
val data = Seq("Du er ikke bedre enn meg").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("nb.lemma").predict("""Du er ikke bedre enn meg""")
```
## Results
```bash
+-------------------------------+
|result |
+-------------------------------+
|[Du, er, ikke, bedre, enn, jeg]|
+-------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma_spacylookup|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|nb|
|Size:|15.3 KB|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from 123tarunanand)
author: John Snow Labs
name: roberta_qa_base_finetuned
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned` is a English model originally trained by `123tarunanand`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_en_4.3.0_3.0_1674216346492.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_en_4.3.0_3.0_1674216346492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_finetuned|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/123tarunanand/roberta-base-finetuned
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265903` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903_en_4.0.0_3.0_1655985000390.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903_en_4.0.0_3.0_1655985000390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265903").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|888.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265903
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-128-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0_en_4.0.0_3.0_1654180777930.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0_en_4.0.0_3.0_1654180777930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base_uncased_128d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-128-finetuned-squad-seed-0
---
layout: model
title: Legal Definition Of Confidential Information Clause Binary Classifier
author: John Snow Labs
name: legclf_def_of_conf_info_clause
date: 2023-02-13
tags: [en, legal, classification, definition, confidential, information, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `def_of_conf_info` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`def_of_conf_info`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_def_of_conf_info_clause_en_1.0.0_3.0_1676302657181.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_def_of_conf_info_clause_en_1.0.0_3.0_1676302657181.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.h2_title}
## Results
```bash
+-----------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
| chunk| entity| target_text(rxnorm)| code|confidence|
+-----------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
| metformin|TREATMENT|Metformin hydrochloride 500 MG Oral Tablet [Glucamet]:::Metformin hydrochloride 850 MG Oral Table...| 105376| 0.2067|
| glipizide|TREATMENT|Glipizide 5 MG Oral Tablet [Minidiab]:::Glipizide 5 MG Oral Tablet [Glucotrol]:::Glipizide 5 MG O...| 105373| 0.2224|
| dapagliflozin for T2DM|TREATMENT|dapagliflozin 5 MG / saxagliptin 5 MG Oral Tablet [Qtern]:::dapagliflozin 10 MG / saxagliptin 5 M...|2169276| 0.2532|
| atorvastatin and gemfibrozil for HTG|TREATMENT|atorvastatin 20 MG / ezetimibe 10 MG Oral Tablet [Liptruzet]:::atorvastatin 40 MG / ezetimibe 10 ...|1422095| 0.2183|
| dapagliflozin|TREATMENT|dapagliflozin 5 MG Oral Tablet [Farxiga]:::dapagliflozin 10 MG Oral Tablet [Farxiga]:::dapagliflo...|1486981| 0.3523|
| bicarbonate|TREATMENT|Sodium Bicarbonate 0.417 MEQ/ML Oral Solution [Desempacho]:::potassium bicarbonate 25 MEQ Efferve...|1305099| 0.2149|
|insulin drip for euDKA and HTG with a reduction|TREATMENT|insulin aspart, human 30 UNT/ML / insulin degludec 70 UNT/ML Pen Injector [Ryzodeg]:::3 ML insuli...|1994318| 0.2124|
| SGLT2 inhibitor|TREATMENT|C1 esterase inhibitor (human) 500 UNT Injection [Cinryze]:::alpha 1-proteinase inhibitor, human 1...| 809871| 0.2044|
| insulin glargine|TREATMENT|Insulin Glargine 100 UNT/ML Pen Injector [Lantus]:::Insulin Glargine 300 UNT/ML Pen Injector [Tou...|1359856| 0.2265|
| insulin lispro with meals|TREATMENT|Insulin Lispro 100 UNT/ML Cartridge [Humalog]:::Insulin Lispro 200 UNT/ML Pen Injector [Humalog]:...|1652648| 0.2469|
| metformin|TREATMENT|Metformin hydrochloride 500 MG Oral Tablet [Glucamet]:::Metformin hydrochloride 850 MG Oral Table...| 105376| 0.2067|
| SGLT2 inhibitors|TREATMENT|alpha 1-proteinase inhibitor, human 1 MG Injection [Prolastin]:::C1 esterase inhibitor (human) 50...|1661220| 0.2167|
+-----------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|----------------------------------|
| Name: | chunkresolve_rxnorm_sbd_clinical |
| Type: | ChunkEntityResolverModel |
| Compatibility: | Spark NLP 2.5.1+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | [token, chunk_embeddings ] |
|Output labels: | [entity] |
| Language: | en |
| Case sensitive: | True |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on December 2019 RxNorm Clinical Drugs (TTY=SBD) ontology graph with `embeddings_clinical`
https://www.nlm.nih.gov/pubs/techbull/nd19/brief/nd19_rxnorm_december_2019_release.html
---
layout: model
title: English T5ForConditionalGeneration Cased model (from ThomasNLG)
author: John Snow Labs
name: t5_qg_squad1
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-qg_squad1-en` is a English model originally trained by `ThomasNLG`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_qg_squad1_en_4.3.0_3.0_1675125547851.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_qg_squad1_en_4.3.0_3.0_1675125547851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_qg_squad1","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_qg_squad1","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_qg_squad1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|923.2 MB|
## References
- https://huggingface.co/ThomasNLG/t5-qg_squad1-en
- https://github.com/ThomasScialom/QuestEval
---
layout: model
title: Detect PHI for Deidentification (Generic)
author: John Snow Labs
name: ner_deid_generic
date: 2022-01-06
tags: [deid, ner, de, licensed]
task: Named Entity Recognition
language: de
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER is a Named Entity Recognition model that annotates German text to find protected health information (PHI) that may need to be deidentified. It was trained with in-house annotations and detects 7 entities.
## Predicted Entities
`DATE`, `NAME`, `LOCATION`, `PROFESSION`, `AGE`, `ID`, `CONTACT`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_DE){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_de_3.3.4_2.4_1641460977185.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_de_3.3.4_2.4_1641460977185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
deid_ner = MedicalNerModel.pretrained("ner_deid_generic", "de", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_deid_generic_chunk")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
deid_ner,
ner_converter])
data = spark.createDataFrame([["""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus
in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val deid_ner = MedicalNerModel.pretrained("ner_deid_generic", "de", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_deid_generic_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
deid_ner,
ner_converter))
val data = Seq("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhausin Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""").toDS.toDF("text")
val result = nlpPipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.med_ner.deid_generic").predict("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus
in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""")
```
## Results
```bash
+-------------------------+----------------------+
|chunk |ner_deid_generic_chunk|
+-------------------------+----------------------+
|Michael Berger |NAME |
|12 Dezember 2018 |DATE |
|St. Elisabeth-Krankenhaus|LOCATION |
|Bad Kissingen |LOCATION |
|Berger |NAME |
|76 |AGE |
+-------------------------+----------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_generic|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|15.0 MB|
## Data Source
In-house annotated dataset
## Benchmarking
```bash
label tp fp fn total precision recall f1
CONTACT 68.0 25.0 12.0 80.0 0.7312 0.85 0.7861
NAME 3965.0 294.0 274.0 4239.0 0.931 0.9354 0.9332
DATE 4049.0 2.0 0.0 4049.0 0.9995 1.0 0.9998
ID 185.0 11.0 32.0 217.0 0.9439 0.8525 0.8959
LOCATION 5065.0 414.0 1021.0 6086.0 0.9244 0.8322 0.8759
PROFESSION 145.0 8.0 117.0 262.0 0.9477 0.5534 0.6988
AGE 458.0 13.0 18.0 476.0 0.9724 0.9622 0.9673
```
---
layout: model
title: Legal Waivers Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_waivers_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, waivers, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Waivers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Waivers`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_waivers_bert_en_1.0.0_3.0_1678049911607.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_waivers_bert_en_1.0.0_3.0_1678049911607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Waivers]|
|[Other]|
|[Other]|
|[Waivers]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_waivers_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.93 0.93 0.93 138
Waivers 0.92 0.92 0.92 106
accuracy - - 0.93 244
macro-avg 0.92 0.92 0.92 244
weighted-avg 0.93 0.93 0.93 244
```
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from microsoft)
author: John Snow Labs
name: t5_ssr_base
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ssr-base` is a English model originally trained by `microsoft`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ssr_base_en_4.3.0_3.0_1675107262685.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ssr_base_en_4.3.0_3.0_1675107262685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_ssr_base","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_ssr_base","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_ssr_base|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|926.9 MB|
## References
- https://huggingface.co/microsoft/ssr-base
- https://arxiv.org/abs/2101.00416
---
layout: model
title: News Classifier Pipeline for Turkish text
author: John Snow Labs
name: classifierdl_bert_news_pipeline
date: 2021-08-27
tags: [tr, news, classification, open_source]
task: Text Classification
language: tr
edition: Spark NLP 3.2.0
spark_version: 2.4
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pre-trained pipeline classifies Turkish texts of news.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_TR_NEWS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_TR_NEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_pipeline_tr_3.2.0_2.4_1630061137177.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_pipeline_tr_3.2.0_2.4_1630061137177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("classifierdl_bert_news_pipeline", lang = "tr")
result = pipeline.fullAnnotate("Bonservisi elinde olan Milli oyuncu, yeni takımıyla el sıkıştı")
result["class"]
```
```scala
val pipeline = new PretrainedPipeline("classifierdl_bert_news_pipeline", "tr")
val result = pipeline.fullAnnotate("Bonservisi elinde olan Milli oyuncu, yeni takımıyla el sıkıştı")(0)
```
## Results
```bash
["Sport"]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_bert_news_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|tr|
## Included Models
- DocumentAssembler
- BertSentenceEmbeddings
- ClassifierDLModel
---
layout: model
title: English Named Entity Recognition (from kSaluja)
author: John Snow Labs
name: bert_ner_autonlp_tele_new_5k_557515810
date: 2022-05-09
tags: [bert, ner, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `autonlp-tele_new_5k-557515810` is a English model orginally trained by `kSaluja`.
## Predicted Entities
`TARGET`, `SUGGESTIONTYPE`, `CALLTYPE`, `INSTRUMENT`, `BUYPRICE`, `HOLDINGPERIOD`, `STOPLOSS`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_autonlp_tele_new_5k_557515810_en_3.4.2_3.0_1652097492338.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_autonlp_tele_new_5k_557515810_en_3.4.2_3.0_1652097492338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_autonlp_tele_new_5k_557515810","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_autonlp_tele_new_5k_557515810","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_autonlp_tele_new_5k_557515810|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/kSaluja/autonlp-tele_new_5k-557515810
---
layout: model
title: Pipeline to Detect Clinical Entities (ner_eu_clinical_case)
author: John Snow Labs
name: ner_eu_clinical_case_pipeline
date: 2023-03-08
tags: [clinical, licensed, ner, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_eu_clinical_case](https://nlp.johnsnowlabs.com/2023/01/25/ner_eu_clinical_case_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_en_4.3.0_3.2_1678262043022.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_en_4.3.0_3.2_1678262043022.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_eu_clinical_case_pipeline", "en", "clinical/models")
text = "
A 3-year-old boy with autistic disorder on hospital of pediatric ward A at university hospital. He has no family history of illness or autistic spectrum disorder. The child was diagnosed with a severe communication disorder, with social interaction difficulties and sensory processing delay. Blood work was normal (thyroid-stimulating hormone (TSH), hemoglobin, mean corpuscular volume (MCV), and ferritin). Upper endoscopy also showed a submucosal tumor causing subtotal obstruction of the gastric outlet. Because a gastrointestinal stromal tumor was suspected, distal gastrectomy was performed. Histopathological examination revealed spindle cell proliferation in the submucosal layer.
"
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_eu_clinical_case_pipeline", "en", "clinical/models")
val text = "
A 3-year-old boy with autistic disorder on hospital of pediatric ward A at university hospital. He has no family history of illness or autistic spectrum disorder. The child was diagnosed with a severe communication disorder, with social interaction difficulties and sensory processing delay. Blood work was normal (thyroid-stimulating hormone (TSH), hemoglobin, mean corpuscular volume (MCV), and ferritin). Upper endoscopy also showed a submucosal tumor causing subtotal obstruction of the gastric outlet. Because a gastrointestinal stromal tumor was suspected, distal gastrectomy was performed. Histopathological examination revealed spindle cell proliferation in the submucosal layer.
"
val result = pipeline.fullAnnotate(text)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_attribute_correction_mlm_titles","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_attribute_correction_mlm_titles","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_attribute_correction_mlm_titles|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|466.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/ksabeh/roberta-base-attribute-correction-mlm-titles
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18 TFWav2Vec2ForCTC from jhonparra18
author: John Snow Labs
name: asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18` is a English model originally trained by jhonparra18.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18_en_4.2.0_3.0_1664019766050.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18_en_4.2.0_3.0_1664019766050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_2_h_512
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-2_H-512` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_512_zh_4.2.4_3.0_1670325890317.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_512_zh_4.2.4_3.0_1670325890317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_512","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_512","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_2_h_512|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|66.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-2_H-512
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken TFWav2Vec2ForCTC from cuzeverynameistaken
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken` is a English model originally trained by cuzeverynameistaken.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken_en_4.2.0_3.0_1664023121407.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken_en_4.2.0_3.0_1664023121407.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Part of Speech for English
author: John Snow Labs
name: pos_ud_ewt
date: 2021-03-08
tags: [part_of_speech, open_source, english, pos_ud_ewt, en]
task: Part of Speech Tagging
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- PROPN
- PUNCT
- ADJ
- NOUN
- VERB
- DET
- ADP
- AUX
- PRON
- PART
- SCONJ
- NUM
- ADV
- CCONJ
- X
- INTJ
- SYM
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ewt_en_3.0.0_3.0_1615230175426.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ewt_en_3.0.0_3.0_1615230175426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_ewt", "en") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['Hello from John Snow Labs ! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_ewt", "en")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("Hello from John Snow Labs ! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
token_df = nlu.load('en.pos.ud_ewt').predict(text)
token_df
```
## Results
```bash
token pos
0 Hello INTJ
1 from ADP
2 John PROPN
3 Snow PROPN
4 Labs PROPN
5 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_ewt|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|en|
---
layout: model
title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbertresolve_icd10cm_augmented_billable_hcc)
author: John Snow Labs
name: sbertresolve_icd10cm_augmented_billable_hcc
date: 2023-05-31
tags: [en, clinical, entity_resolution, icd10cm, billable, hcc, licensed]
task: Entity Resolution
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts to ICD-10-CM codes using `sbert_jsl_medium_uncased` sentence bert embeddings and it supports 7-digit codes with Hierarchical Condition Categories (HCC) status. It also returns the official resolution text within the brackets inside the metadata. The model is augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths).
In the result, look for the `all_k_aux_labels` parameter in the metadata to get HCC status. This column can be divided to get further details: billable status || hcc status || hcc score. For example, if `all_k_aux_labels` is like `1||1||19` which means the `billable status` is 1, hcc status is 1, and `hcc score` is 19.
## Predicted Entities
`ICD-10-CM Codes`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_augmented_billable_hcc_en_4.4.2_3.0_1685534837223.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_augmented_billable_hcc_en_4.4.2_3.0_1685534837223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["PROBLEM"])
chunk2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
bert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("bert_embeddings")\
.setCaseSensitive(False)
icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models")\
.setInputCols(["bert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunk2doc,
bert_embeddings,
icd10_resolver])
data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."]]).toDF("text")
results = nlpPipeline.fit(data_ner).transform(data_ner)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("PROBLEM"))
val chunk2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val bert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models")
.setInputCols("ner_chunk_doc")
.setOutputCol("bert_embeddings")
.setCaseSensitive(False)
val icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models")
.setInputCols("bert_embeddings")
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_wechsel_chinese","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["我喜欢Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_wechsel_chinese","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("我喜欢Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.roberta_base_wechsel_chinese").predict("""我喜欢Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_roberta_base_wechsel_chinese|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|468.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/benjamin/roberta-base-wechsel-chinese
- https://github.com/CPJKU/wechsel
- https://arxiv.org/abs/2112.06598
---
layout: model
title: English BertForQuestionAnswering Mini Uncased model (from Renukswamy)
author: John Snow Labs
name: bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minilm-uncased-squad2-finetuned-squad` is a English model originally trained by `Renukswamy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad_en_4.0.0_3.0_1657190358667.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad_en_4.0.0_3.0_1657190358667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|124.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Renukswamy/minilm-uncased-squad2-finetuned-squad
---
layout: model
title: Pipeline to Resolve ICD-10-CM Codes
author: John Snow Labs
name: icd10cm_resolver_pipeline
date: 2022-11-02
tags: [en, clinical, licensed, resolver, chunk_mapping, pipeline, icd10cm]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Spark NLP for Healthcare 4.2.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline maps entities with their corresponding ICD-10-CM codes. You’ll just feed your text and it will return the corresponding ICD-10-CM codes.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_resolver_pipeline_en_4.2.1_3.0_1667389014041.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_resolver_pipeline_en_4.2.1_3.0_1667389014041.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
resolver_pipeline = PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models")
text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage"""
result = resolver_pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val resolver_pipeline = new PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models")
val result = resolver_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.icd10cm_resolver.pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""")
```
## Results
```bash
+-----------------------------+---------+------------+
|chunk |ner_chunk|icd10cm_code|
+-----------------------------+---------+------------+
|gestational diabetes mellitus|PROBLEM |O24.919 |
|anisakiasis |PROBLEM |B81.0 |
|fetal and neonatal hemorrhage|PROBLEM |P545 |
+-----------------------------+---------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|icd10cm_resolver_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP for Healthcare 4.2.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|3.5 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- ChunkMapperModel
- ChunkMapperModel
- ChunkMapperFilterer
- Chunk2Doc
- BertSentenceEmbeddings
- SentenceEntityResolverModel
- ResolverMerger
---
layout: model
title: Extract entities in clinical trial abstracts (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_clinical_trials_abstracts
date: 2022-06-29
tags: [berttokenclassifier, bert, biobert, en, ner, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Named Entity Recognition model is trained with the BertForTokenClassification method from transformers library and imported into Spark NLP.
It extracts relevant entities from clinical trial abstracts. It uses a simplified version of the ontology specified by [Sanchez Graillet, O., et al.](https://pub.uni-bielefeld.de/record/2939477) in order to extract concepts related to trial design, diseases, drugs, population, statistics and publication.
## Predicted Entities
`Age`, `AllocationRatio`, `Author`, `BioAndMedicalUnit`, `CTAnalysisApproach`, `CTDesign`, `Confidence`, `Country`, `DisorderOrSyndrome`, `DoseValue`, `Drug`, `DrugTime`, `Duration`, `Journal`, `NumberPatients`, `PMID`, `PValue`, `PercentagePatients`, `PublicationYear`, `TimePoint`, `Value`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_en_3.5.3_3.0_1656475829985.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_en_3.5.3_3.0_1656475829985.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "en", "clinical/models")\
.setInputCols("token", "sentence")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter])
text = ["This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime."]
data = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "en", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("ner")
.setCaseSensitive(True)
val. ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val text = "This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime."
val data = Seq(text).toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.clinical_trials_abstracts").predict("""This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmr_large_qa_sv_sv_m3hrdadfi","sv") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlmr_large_qa_sv_sv_m3hrdadfi","sv")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("sv.answer_question.xlmr_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlmr_large_qa_sv_sv_m3hrdadfi|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|sv|
|Size:|1.9 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/m3hrdadfi/xlmr-large-qa-sv
---
layout: model
title: Turkish BertForQuestionAnswering model (from emre)
author: John Snow Labs
name: bert_qa_distilbert_tr_q_a
date: 2022-06-02
tags: [tr, open_source, question_answering, bert]
task: Question Answering
language: tr
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-tr-q-a` is a Turkish model orginally trained by `emre`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_distilbert_tr_q_a_tr_4.0.0_3.0_1654187587161.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_distilbert_tr_q_a_tr_4.0.0_3.0_1654187587161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_distilbert_tr_q_a","tr") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_distilbert_tr_q_a","tr")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("tr.answer_question.bert.distilled").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_distilbert_tr_q_a|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|tr|
|Size:|412.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/emre/distilbert-tr-q-a
- https://github.com/TQuad/turkish-nlp-qa-dataset
---
layout: model
title: Translate English to Tetun Dili Pipeline
author: John Snow Labs
name: translate_en_tdt
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, tdt, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `tdt`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tdt_xx_2.7.0_2.4_1609687644374.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tdt_xx_2.7.0_2.4_1609687644374.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_tdt", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_tdt", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.tdt').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_tdt|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Castilian, Spanish BertForQuestionAnswering model (from bhavikardeshna)
author: John Snow Labs
name: bert_qa_multilingual_bert_base_cased_spanish
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-spanish` is a Castilian, Spanish model orginally trained by `bhavikardeshna`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_spanish_es_4.0.0_3.0_1654188563403.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_spanish_es_4.0.0_3.0_1654188563403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_bert_base_cased_spanish","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_multilingual_bert_base_cased_spanish","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.bert.multilingual_spanish_tuned_base_cased.by_bhavikardeshna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_multilingual_bert_base_cased_spanish|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-spanish
---
layout: model
title: Financial Relation Extraction on 10K filings (Small)
author: John Snow Labs
name: finre_financial_small
date: 2022-11-07
tags: [financial, 10k, filings, en, licensed]
task: Relation Extraction
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: RelationExtractionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts relations between amounts, counts, percentages, dates and the financial entities extracted with one of these models:
`finner_financial_small`
`finner_financial_medium`
`finner_financial_large`
We highly recommend using it with `finner_financial_large`.
## Predicted Entities
`has_amount`, `has_amount_date`, `has_percentage_date`, `has_percentage`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_financial_small_en_1.0.0_3.0_1667815219417.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_financial_small_en_1.0.0_3.0_1667815219417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencizer = nlp.SentenceDetectorDLModel\
.pretrained("sentence_detector_dl", "en") \
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])
bert_embeddings= nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
.setInputCols(["sentence", "token"])\
.setOutputCol("bert_embeddings")
ner_model = finance.NerModel.pretrained("finner_financial_large", "en", "finance/models")\
.setInputCols(["sentence", "token", "bert_embeddings"])\
.setOutputCol("ner")\
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
# ===========
# This is needed only to filter relation pairs using finance.RENerChunksFilter (see below)
# ===========
pos = nlp.PerceptronModel.pretrained("pos_anc", 'en')\
.setInputCols("sentence", "token")\
.setOutputCol("pos")
dependency_parser = nlp.DependencyParserModel.pretrained("dependency_conllu", "en") \
.setInputCols(["sentence", "pos", "token"]) \
.setOutputCol("dependencies")
ENTITIES = ['PROFIT', 'PROFIT_INCREASE', 'PROFIT_DECLINE', 'CF', 'CF_INCREASE', 'CF_DECREASE', 'LIABILITY', 'EXPENSE', 'EXPENSE_INCREASE', 'EXPENSE_DECREASE']
ENTITY_PAIRS = [f"{x}-AMOUNT" for x in ENTITIES]
ENTITY_PAIRS.extend([f"{x}-COUNT" for x in ENTITIES])
ENTITY_PAIRS.extend([f"{x}-PERCENTAGE" for x in ENTITIES])
ENTITY_PAIRS.append(f"AMOUNT-FISCAL_YEAR")
ENTITY_PAIRS.append(f"AMOUNT-DATE")
ENTITY_PAIRS.append(f"AMOUNT-CURRENCY")
re_ner_chunk_filter = finance.RENerChunksFilter() \
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setRelationPairs(ENTITY_PAIRS)\
.setMaxSyntacticDistance(5)
# ===========
reDL = finance.RelationExtractionDLModel.pretrained('finre_financial_small', 'en', 'finance/models')\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relations")
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentencizer,
tokenizer,
bert_embeddings,
ner_model,
ner_converter,
pos,
dependency_parser,
re_ner_chunk_filter,
reDL])
text = "In the third quarter of fiscal 2021, we received net proceeds of $342.7 million, after deducting underwriters discounts and commissions and offering costs of $31.8 million, including the exercise of the underwriters option to purchase additional shares. "
data = spark.createDataFrame([[text]]).toDF("text")
model = pipeline.fit(data)
results = model.transform(data)
```
## Results
```bash
relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence
has_amount CF 49 60 net proceeds AMOUNT 66 78 342.7 million 0.9999101
has_amount CURRENCY 65 65 $ AMOUNT 66 78 342.7 million 0.9925425
has_amount EXPENSE 125 154 commissions and offering costs AMOUNT 160 171 31.8 million 0.9997677
has_amount CURRENCY 159 159 $ AMOUNT 160 171 31.8 million 0.998896
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finre_financial_small|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|405.7 MB|
## References
In-house annotations of 10K filings.
## Benchmarking
```bash
Relation Recall Precision F1 Support
has_amount 0.997 0.997 0.997 670
has_amount_date 0.996 0.994 0.995 470
has_percentage 1.000 1.000 1.000 87
has_percentage_date 0.985 1.000 0.993 68
other 1.000 1.000 1.000 205
Avg. 0.996 0.998 0.997 1583
Weighted-Avg. 0.997 0.997 0.997 1583
```
---
layout: model
title: Turkish Electra Embeddings (from dbmdz)
author: John Snow Labs
name: electra_embeddings_electra_base_turkish_mc4_uncased_generator
date: 2022-05-17
tags: [tr, open_source, electra, embeddings]
task: Embeddings
language: tr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-turkish-mc4-uncased-generator` is a Turkish model orginally trained by `dbmdz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_turkish_mc4_uncased_generator_tr_3.4.4_3.0_1652786631684.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_turkish_mc4_uncased_generator_tr_3.4.4_3.0_1652786631684.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_turkish_mc4_uncased_generator","tr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_turkish_mc4_uncased_generator","tr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Spark NLP'yi seviyorum").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_electra_base_turkish_mc4_uncased_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|tr|
|Size:|130.7 MB|
|Case sensitive:|false|
## References
- https://huggingface.co/dbmdz/electra-base-turkish-mc4-uncased-generator
- https://zenodo.org/badge/latestdoi/237817454
- https://twitter.com/mervenoyann
- https://github.com/allenai/allennlp/discussions/5265
- https://github.com/dbmdz
- http://www.andrew.cmu.edu/user/ko/
- https://twitter.com/mervenoyann
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_nl6
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nl6` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl6_en_4.3.0_3.0_1675123966207.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl6_en_4.3.0_3.0_1675123966207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_nl6","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_nl6","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_nl6|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|53.2 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-nl6
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Image De-Identification
author: John Snow Labs
name: ner_deid_large
date: 2023-01-03
tags: [en, licensed, ocr, image_deidentification]
task: Image DeIdentification
language: en
nav_key: models
edition: Visual NLP 4.0.0
spark_version: 3.2.1
supported: true
annotator: ImageDeIdentification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Deidentification NER (Large) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. The entities it annotates are `Age`, `Contact`, `Date`, `Id`, `Location`, `Name`, and `Profession`. This model is trained with the `embeddings_clinical` word embeddings model, so be sure to use the same embeddings in the pipeline.
It protects specific health information that could identify living or deceased individuals. The rule preserves patient confidentiality without affecting the values and the information that could be needed for different research purposes.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/ocr/DEID_IMAGE/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/3.1.SparkOcrImageDeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_en_3.0.0_3.0_1617209688468.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_en_3.0.0_3.0_1617209688468.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
def deidentification_nlp_pipeline(input_column, prefix = ""):
document_assembler = DocumentAssembler() \
.setInputCol(input_column) \
.setOutputCol(prefix + "document")
# Sentence Detector annotator, processes various sentences per line
sentence_detector = SentenceDetector() \
.setInputCols([prefix + "document"]) \
.setOutputCol(prefix + "sentence")
tokenizer = Tokenizer() \
.setInputCols([prefix + "sentence"]) \
.setOutputCol(prefix + "token")
# Clinical word embeddings
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols([prefix + "sentence", prefix + "token"]) \
.setOutputCol(prefix + "embeddings")
# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = MedicalNerModel.pretrained("ner_deid_large", "en", "clinical/models") \
.setInputCols([prefix + "sentence", prefix + "token", prefix + "embeddings"]) \
.setOutputCol(prefix + "ner")
custom_ner_converter = NerConverter() \
.setInputCols([prefix + "sentence", prefix + "token", prefix + "ner"]) \
.setOutputCol(prefix + "ner_chunk") \
.setWhiteList(["NAME", "AGE", "CONTACT", "LOCATION", "PROFESSION", "PERSON", "DATE"])
nlp_pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
custom_ner_converter
])
empty_data = spark.createDataFrame([[""]]).toDF(input_column)
nlp_model = nlp_pipeline.fit(empty_data)
return nlp_model
# Convert to images
binary_to_image = BinaryToImage() \
.setInputCol("content") \
.setOutputCol("image_raw")
# Extract text from image
ocr = ImageToText() \
.setInputCol("image_raw") \
.setOutputCol("text") \
.setIgnoreResolution(False) \
.setPageIteratorLevel(PageIteratorLevel.SYMBOL) \
.setPageSegMode(PageSegmentationMode.SPARSE_TEXT) \
.setConfidenceThreshold(70)
# Found coordinates of sensitive data
position_finder = PositionFinder() \
.setInputCols("ner_chunk") \
.setOutputCol("coordinates") \
.setPageMatrixCol("positions") \
.setMatchingWindow(1000) \
.setPadding(1)
# Draw filled rectangle for hide sensitive data
drawRegions = ImageDrawRegions() \
.setInputCol("image_raw") \
.setInputRegionsCol("coordinates") \
.setOutputCol("image_with_regions") \
.setFilledRect(True) \
.setRectColor(Color.gray)
# OCR pipeline
pipeline = Pipeline(stages=[
binary_to_image,
ocr,
deidentification_nlp_pipeline(input_column="text"),
position_finder,
drawRegions
])
image_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/images/p1.jpg")
image_df = spark.read.format("binaryFile").load(image_path)
result = pipeline.fit(image_df).transform(image_df).cache()
```
```scala
def deidentification_nlp_pipeline(input_column, prefix = ""):
val document_assembler = new DocumentAssembler()
.setInputCol(input_column)
.setOutputCol(prefix + "document")
# Sentence Detector annotator, processes various sentences per line
val sentence_detector = new SentenceDetector()
.setInputCols(Array(prefix + "document"))
.setOutputCol(prefix + "sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array(prefix + "sentence"))
.setOutputCol(prefix + "token")
# Clinical word embeddings
val word_embeddings = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array(prefix + "sentence", prefix + "token"))
.setOutputCol(prefix + "embeddings")
# NER model trained on i2b2 (sampled from MIMIC) dataset
val clinical_ner = MedicalNerModel
.pretrained("ner_deid_large", "en", "clinical/models")
.setInputCols(Array(prefix + "sentence", prefix + "token", prefix + "embeddings"))
.setOutputCol(prefix + "ner")
val custom_ner_converter = new NerConverter()
.setInputCols(Array(prefix + "sentence", prefix + "token", prefix + "ner"))
.setOutputCol(prefix + "ner_chunk")
.setWhiteList(Array("NAME", "AGE", "CONTACT", "LOCATION", "PROFESSION", "PERSON", "DATE"))
val nlp_pipeline = new Pipeline.setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
custom_ner_converter
))
val empty_data = spark.createDataFrame(Array("")).toDF(input_column)
val nlp_model = nlp_pipeline.fit(empty_data)
return nlp_model
# Convert to images
val binary_to_image = new BinaryToImage()
.setInputCol("content")
.setOutputCol("image_raw")
# Extract text from image
val ocr = new ImageToText()
.setInputCol("image_raw")
.setOutputCol("text")
.setIgnoreResolution(False)
.setPageIteratorLevel(PageIteratorLevel.SYMBOL)
.setPageSegMode(PageSegmentationMode.SPARSE_TEXT)
.setConfidenceThreshold(70)
# Found coordinates of sensitive data
val position_finder = new PositionFinder()
.setInputCols("ner_chunk")
.setOutputCol("coordinates")
.setPageMatrixCol("positions")
.setMatchingWindow(1000)
.setPadding(1)
# Draw filled rectangle for hide sensitive data
val drawRegions = new ImageDrawRegions()
.setInputCol("image_raw")
.setInputRegionsCol("coordinates")
.setOutputCol("image_with_regions")
.setFilledRect(True)
.setRectColor(Color.gray)
# OCR pipeline
val pipeline = new Pipeline().setStages(Array(
binary_to_image,
ocr,
deidentification_nlp_pipeline(input_column="text"),
position_finder,
drawRegions))
val image_path = pkg_resources.resource_filename(Array("sparkocr", "resources/ocr/images/p1.jpg"))
val image_df = spark.read.format("binaryFile").load(image_path)
val result = pipeline.fit(image_df).transform(image_df).cache()
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mwl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mwl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("mwl.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|mwl|
|Size:|113.5 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Multilingual XlmRoBertaForQuestionAnswering (from gokulkarthik)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_qa_chaii
date: 2022-06-23
tags: [en, hi, ta, open_source, question_answering, xlmroberta, xx]
task: Question Answering
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-qa-chaii` is a multilingual model originally trained by `gokulkarthik`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_qa_chaii_xx_4.0.0_3.0_1655996639001.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_qa_chaii_xx_4.0.0_3.0_1655996639001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_qa_chaii","xx") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_qa_chaii","xx")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.answer_question.chaii.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_qa_chaii|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|xx|
|Size:|885.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/gokulkarthik/xlm-roberta-qa-chaii
---
layout: model
title: Korean BertForQuestionAnswering model (from eliza-dukim)
author: John Snow Labs
name: bert_qa_bert_base_multilingual_cased_korquad_v1
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: ko
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased_korquad-v1` is a Korean model orginally trained by `eliza-dukim`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_korquad_v1_ko_4.0.0_3.0_1654180276142.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_korquad_v1_ko_4.0.0_3.0_1654180276142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_korquad_v1","ko") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_multilingual_cased_korquad_v1","ko")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.answer_question.korquad.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_multilingual_cased_korquad_v1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|ko|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/eliza-dukim/bert-base-multilingual-cased_korquad-v1
---
layout: model
title: Extract relations between problem, treatment and test entities (ReDL)
author: John Snow Labs
name: redl_clinical_biobert
date: 2023-01-14
tags: [en, licensed, relation_extraction, clinical, tensorflow]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract relations like `TrIP` : a certain treatment has improved a medical problem and 7 other such relations between problem, treatment and test entities.
## Predicted Entities
`PIP`, `TeCP`, `TeRP`, `TrAP`, `TrCP`, `TrIP`, `TrNAP`, `TrWP`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_4.2.4_3.0_1673727174891.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_4.2.4_3.0_1673727174891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel() \
.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens", "embeddings"]) \
.setOutputCol("ner_tags")
ner_converter = NerConverterInternal() \
.setInputCols(["sentences", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentences", "pos_tags", "tokens"]) \
.setOutputCol("dependencies")
# Set a filter on pairs of named entities which will be treated as relation candidates
re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setMaxSyntacticDistance(10)\
.setOutputCol("re_ner_chunks")\
.setRelationPairs(["problem-test", "problem-treatment"])
# The dataset this model is trained to is sentence-wise.
# This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
re_model = RelationExtractionDLModel()\
.pretrained('redl_clinical_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model])
text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ),
one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely .
She had close follow-up with endocrinology post discharge .
"""
data = spark.createDataFrame([[text]]).toDF("text")
p_model = pipeline.fit(data)
result = p_model.transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentences"))
.setOutputCol("tokens")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
.setRelationPairs(Array("problem-test", "problem-treatment"))
// The dataset this model is trained to is sentence-wise.
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val re_model = RelationExtractionDLModel()
.pretrained("redl_clinical_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model))
val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ),
one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely .
She had close follow-up with endocrinology post discharge .
""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.date_test_result.pipeline").predict("""He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""")
```
## Results
```bash
| index | relations | entity1 | chunk1 | entity2 | chunk2 |
|-------|--------------|--------------|---------------------|--------------|---------|
| 0 | O | TEST | chest X-ray | MEASUREMENTS | 93% |
| 1 | O | TEST | CT scan | MEASUREMENTS | 93% |
| 2 | is_result_of | TEST | SpO2 | MEASUREMENTS | 93% |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|re_test_result_date_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- PerceptronModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- DependencyParserModel
- RelationExtractionModel
---
layout: model
title: Fast Neural Machine Translation Model from English to Azerbaijani
author: John Snow Labs
name: opus_mt_en_az
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, az, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `az`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_az_xx_2.7.0_2.4_1609166632809.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_az_xx_2.7.0_2.4_1609166632809.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_az", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_az", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.az').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_az|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Multilingual XLMRoBerta Embeddings (from hfl)
author: John Snow Labs
name: xlmroberta_embeddings_cino_small_v2
date: 2022-05-13
tags: [zh, ko, open_source, xlm_roberta, embeddings, xx, cino]
task: Embeddings
language: xx
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: XlmRoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `cino-small-v2` is a Multilingual model orginally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_cino_small_v2_xx_3.4.4_3.0_1652439686002.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_cino_small_v2_xx_3.4.4_3.0_1652439686002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_cino_small_v2","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_cino_small_v2","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_embeddings_cino_small_v2|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Size:|552.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/cino-small-v2
- https://github.com/ymcui/Chinese-Minority-PLM
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
---
layout: model
title: Legal Pledge And Security Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_pledge_and_security_agreement_bert
date: 2022-11-25
tags: [en, legal, classification, agreement, pledge, licensed, bert]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_pledge_and_security_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `pledge-and-security-agreement` for similar document type classification) or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`pledge-and-security-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_pledge_and_security_agreement_bert_en_1.0.0_3.0_1669368647407.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_pledge_and_security_agreement_bert_en_1.0.0_3.0_1669368647407.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[pledge-and-security-agreement]|
|[other]|
|[other]|
|[pledge-and-security-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_pledge_and_security_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.95 0.96 0.96 82
pledge-and-security-agreement 0.91 0.89 0.90 35
accuracy - - 0.94 117
macro-avg 0.93 0.92 0.93 117
weighted-avg 0.94 0.94 0.94 117
```
---
layout: model
title: Part of Speech for Hebrew
author: John Snow Labs
name: pos_ud_htb
date: 2021-03-09
tags: [part_of_speech, open_source, hebrew, pos_ud_htb, he]
task: Part of Speech Tagging
language: he
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- None
- DET
- NOUN
- VERB
- CCONJ
- ADP
- PRON
- PUNCT
- ADJ
- ADV
- SCONJ
- NUM
- PROPN
- AUX
- X
- INTJ
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_htb_he_3.0.0_3.0_1615292289236.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_htb_he_3.0.0_3.0_1615292289236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_htb", "he") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['שלום מ John Snow Labs! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_htb", "he")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("שלום מ John Snow Labs! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""שלום מ John Snow Labs! ""]
token_df = nlu.load('he.pos.ud_htb').predict(text)
token_df
```
## Results
```bash
token pos
0 שלום None
1 מ ADP
2 John NOUN
3 Snow NOUN
4 Labs NOUN
5 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_htb|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|he|
---
layout: model
title: English T5ForConditionalGeneration Cased model (from gagan3012)
author: John Snow Labs
name: t5_k2t_new
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `k2t-new` is a English model originally trained by `gagan3012`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_k2t_new_en_4.3.0_3.0_1675103876567.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_k2t_new_en_4.3.0_3.0_1675103876567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_k2t_new","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_k2t_new","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_k2t_new|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|262.7 MB|
## References
- https://huggingface.co/gagan3012/k2t-new
- https://user-images.githubusercontent.com/49101362/116334480-f5e57a00-a7dd-11eb-987c-186477f94b6e.png
- https://pypi.org/project/keytotext/
- https://pepy.tech/project/keytotext
- https://colab.research.google.com/github/gagan3012/keytotext/blob/master/Examples/K2T.ipynb
- https://share.streamlit.io/gagan3012/keytotext/UI/app.py
- https://github.com/gagan3012/keytotext/tree/master/Training%20Notebooks
- https://colab.research.google.com/github/gagan3012/keytotext/blob/master/Examples/K2T.ipynb
- https://github.com/gagan3012/keytotext/tree/master/Examples
- https://user-images.githubusercontent.com/49101362/116220679-90e64180-a755-11eb-9246-82d93d924a6c.png
- https://share.streamlit.io/gagan3012/keytotext/UI/app.py
- https://github.com/gagan3012/streamlit-tags
- https://user-images.githubusercontent.com/49101362/116162205-fc042980-a6fd-11eb-892e-8f6902f193f4.png
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC5CDR_Chem_Modified_SciBERT_384
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC5CDR-Chem-Modified-SciBERT-384` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_SciBERT_384_en_4.0.0_3.0_1657109428387.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_SciBERT_384_en_4.0.0_3.0_1657109428387.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_SciBERT_384","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_SciBERT_384","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC5CDR_Chem_Modified_SciBERT_384|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|410.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC5CDR-Chem-Modified-SciBERT-384
---
layout: model
title: Mapping Vaccine Products with Their Corresponding CVX Codes, Vaccine Names and CPT Codes
author: John Snow Labs
name: cvx_name_mapper
date: 2022-10-12
tags: [cvx, chunk_mapping, cpt, en, clinical, licensed]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 4.2.1
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps vaccine products with their corresponding CVX codes, vaccine names and CPT codes. It returns 3 types of vaccine names; `short_name`, `full_name` and `trade_name`.
## Predicted Entities
`cvx_code`, `short_name`, `full_name`, `trade_name`, `cpt_code`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/cvx_name_mapper_en_4.2.1_3.0_1665599269592.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/cvx_name_mapper_en_4.2.1_3.0_1665599269592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('doc')
chunk_assembler = Doc2Chunk()\
.setInputCols(['doc'])\
.setOutputCol('ner_chunk')
chunkerMapper = ChunkMapperModel\
.pretrained("cvx_name_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setRels(["cvx_code", "short_name", "full_name", "trade_name", "cpt_code"])
mapper_pipeline = Pipeline(stages=[
document_assembler,
chunk_assembler,
chunkerMapper
])
data = spark.createDataFrame([['DTaP'], ['MYCOBAX'], ['cholera, live attenuated']]).toDF('text')
res = mapper_pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("doc")
val chunk_assembler = new Doc2Chunk()
.setInputCols(Array("doc"))
.setOutputCol("ner_chunk")
val chunkerMapper = ChunkMapperModel.pretrained("cvx_name_mapper", "en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("mappings")
.setRels(Array("cvx_code", "short_name", "full_name", "trade_name", "cpt_code"))
val pipeline = new Pipeline(stages = Array(
documentAssembler,
chunk_assembler,
chunkerMapper))
val data = Seq("DTaP", "MYCOBAX", "cholera, live attenuated").toDS.toDF("text")
val result= pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.cvx_name").predict("""cholera, live attenuated""")
```
## Results
```bash
+--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+
|chunk |cvx_code|short_name |full_name |trade_name |cpt_code|
+--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+
|[DTaP] |[20] |[DTaP] |[diphtheria, tetanus toxoids and acellular pertussis vaccine]|[ACEL-IMUNE]|[90700] |
|[MYCOBAX] |[19] |[BCG] |[Bacillus Calmette-Guerin vaccine] |[MYCOBAX] |[90585] |
|[cholera, live attenuated]|[174] |[cholera, live attenuated]|[cholera, live attenuated] |[VAXCHORA] |[90625] |
+--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|cvx_name_mapper|
|Compatibility:|Healthcare NLP 4.2.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|25.1 KB|
---
layout: model
title: Indonesian RoBERTa Embeddings (Large)
author: John Snow Labs
name: roberta_embeddings_indonesian_roberta_large
date: 2022-04-14
tags: [roberta, embeddings, id, open_source]
task: Embeddings
language: id
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indonesian-roberta-large` is a Indonesian model orginally trained by `flax-community`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indonesian_roberta_large_id_3.4.2_3.0_1649948688324.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indonesian_roberta_large_id_3.4.2_3.0_1649948688324.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indonesian_roberta_large","id") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Saya suka percikan NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indonesian_roberta_large","id")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Saya suka percikan NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("id.embed.indonesian_roberta_large").predict("""Saya suka percikan NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_indonesian_roberta_large|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|id|
|Size:|632.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/flax-community/indonesian-roberta-large
- https://arxiv.org/abs/1907.11692
- https://hf.co/w11wo
- https://hf.co/stevenlimcorn
- https://hf.co/munggok
- https://hf.co/chewkokwah
---
layout: model
title: English DistilBertForQuestionAnswering model (from andi611) Squad2 with Neg, Multi
author: John Snow Labs
name: distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2-with-ner-with-neg-with-multi` is a English model originally trained by `andi611`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_en_4.0.0_3.0_1654727406477.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_en_4.0.0_3.0_1654727406477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_conll.distil_bert.base_uncased_with_neg_with_multi.by_andi611").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/andi611/distilbert-base-uncased-squad2-with-ner-with-neg-with-multi
---
layout: model
title: Pipeline to Detect Symptoms, Treatments and Other Entities in German
author: John Snow Labs
name: ner_healthcare_pipeline
date: 2023-03-15
tags: [ner, healthcare, licensed, de]
task: Named Entity Recognition
language: de
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_healthcare](https://nlp.johnsnowlabs.com/2021/09/15/ner_healthcare_de.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_pipeline_de_4.3.0_3.2_1678880382332.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_pipeline_de_4.3.0_3.2_1678880382332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_healthcare_pipeline", "de", "clinical/models")
text = '''Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_healthcare_pipeline", "de", "clinical/models")
val text = "Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------|--------:|------:|:----------------------|-------------:|
| 0 | Kleinzellige | 4 | 15 | MEASUREMENT | 0.6897 |
| 1 | Bronchialkarzinom | 17 | 33 | MEDICAL_CONDITION | 0.8983 |
| 2 | Kleinzelliger | 36 | 48 | MEDICAL_SPECIFICATION | 0.1777 |
| 3 | Lungenkrebs | 50 | 60 | MEDICAL_CONDITION | 0.9776 |
| 4 | SCLC | 63 | 66 | MEDICAL_CONDITION | 0.9626 |
| 5 | Hernia | 73 | 78 | MEDICAL_CONDITION | 0.8177 |
| 6 | femoralis | 80 | 88 | LOCAL_SPECIFICATION | 0.9119 |
| 7 | Akne | 91 | 94 | MEDICAL_CONDITION | 0.9995 |
| 8 | einseitig | 97 | 105 | MEASUREMENT | 0.909 |
| 9 | hochmalignes | 112 | 123 | MEDICAL_CONDITION | 0.6778 |
| 10 | bronchogenes | 125 | 136 | BODY_PART | 0.621 |
| 11 | Karzinom | 138 | 145 | MEDICAL_CONDITION | 0.8118 |
| 12 | Lunge | 179 | 183 | BODY_PART | 0.9985 |
| 13 | Hauptbronchus | 195 | 207 | BODY_PART | 0.9864 |
| 14 | mittlere | 223 | 230 | MEASUREMENT | 0.9651 |
| 15 | Prävalenz | 232 | 240 | MEDICAL_CONDITION | 0.9833 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_healthcare_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|de|
|Size:|1.3 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Translate English to Hindi Pipeline
author: John Snow Labs
name: translate_en_hi
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, hi, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `hi`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_hi_xx_2.7.0_2.4_1609689464668.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_hi_xx_2.7.0_2.4_1609689464668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_hi", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_hi", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.hi').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_hi|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for CPT codes (Augmented)
author: John Snow Labs
name: sbiobertresolve_cpt_augmented
date: 2021-05-30
tags: [licensed, entity_resolution, en, clinical]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.4
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to CPT codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. This model is enriched with augmented data for better performance.
## Predicted Entities
CPT codes and their descriptions.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_augmented_en_3.0.4_3.0_1622372290384.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_augmented_en_3.0.4_3.0_1622372290384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
chunk2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_augmented","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver])
data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val chunk2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("ner_chunk_doc")
.setOutputCol("sbert_embeddings")
val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_augmented","en", "clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver))
val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.cpt.augmented").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""")
```
## Results
```bash
+--------------------+-----+---+-------+-----+----------+--------------------+--------------------+
| chunk|begin|end| entity| code|confidence| all_k_resolutions| all_k_codes|
+--------------------+-----+---+-------+-----+----------+--------------------+--------------------+
| hypertension| 68| 79|PROBLEM|36440| 0.3349|Hypertransfusion:...|36440:::24935:::0...|
|chronic renal ins...| 83|109|PROBLEM|50395| 0.0821|Nephrostomy:::Ren...|50395:::50328:::5...|
| COPD| 113|116|PROBLEM|32960| 0.1575|Lung collapse pro...|32960:::32215:::1...|
| gastritis| 120|128|PROBLEM|43501| 0.1772|Gastric ulcer sut...|43501:::43631:::4...|
| TIA| 136|138|PROBLEM|61460| 0.1432|Intracranial tran...|61460:::64742:::2...|
|a non-ST elevatio...| 182|202|PROBLEM|61624| 0.1151|Percutaneous non-...|61624:::61626:::3...|
|Guaiac positive s...| 208|229|PROBLEM|44005| 0.1115|Enterolysis:::Abd...|44005:::49080:::4...|
| mid LAD lesion| 332|345|PROBLEM|0281T| 0.2407|Plication of left...|0281T:::93462:::9...|
| hypotension| 362|372|PROBLEM|99135| 0.9935|Induced hypotensi...|99135:::99185:::9...|
| bradycardia| 378|388|PROBLEM|99135| 0.3884|Induced hypotensi...|99135:::33305:::3...|
| vagal reaction| 466|479|PROBLEM|55450| 0.1427|Vasoligation:::Va...|55450:::64408:::7...|
+--------------------+-----+---+-------+-----+----------+--------------------+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_cpt_augmented|
|Compatibility:|Healthcare NLP 3.0.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[cpt_code_aug]|
|Language:|en|
|Case sensitive:|false|
---
layout: model
title: Social Determinants of Health (slim)
author: John Snow Labs
name: ner_sdoh_slim_wip
date: 2022-11-15
tags: [en, licensed, sdoh, social_determinants, public_health, clinical]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.1
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts terminology related to `Social Determinants of Health ` from various kinds of biomedical documents.
## Predicted Entities
`Housing`, `Smoking`, `Substance_Frequency`, `Childhood_Development`, `Age`, `Other_Disease`, `Employment`, `Marital_Status`, `Diet`, `Disability`, `Mental_Health`, `Alcohol`, `Substance_Quantity`, `Family_Member`, `Race_Ethnicity`, `Gender`, `Geographic_Entity`, `Sexual_Orientation`, `Substance_Use`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_slim_wip_en_4.2.1_3.0_1668524622964.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_slim_wip_en_4.2.1_3.0_1668524622964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_sdoh_slim_wip", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter]
text = [""" Mother states that he does smoke, there is a family hx of alcohol on both maternal and paternal sides of the family, maternal grandfather who died of alcohol related complications and paternal grandmother with severe alcoholism. Pts own drinking began at age 16, living in LA, had a DUI at age 17 after totaling a new car that his mother bought for him, he was married. """]
data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_sdoh_slim_wip", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val nlpPipeline = new PipelineModel().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter))
val data = Seq("""Mother states that there is a family hx of alcohol on both maternal and paternal sides of the family, maternal grandfather who died of alcohol related complications and paternal grandmother with severe alcoholism. Pts own drinking began at age 16, had a DUI at age 17 after totaling a new car that his mother bought for him, he was married.""").toDS.toDF("text")
val result = nlpPipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.sdoh_slim_wip").predict(""" Mother states that he does smoke, there is a family hx of alcohol on both maternal and paternal sides of the family, maternal grandfather who died of alcohol related complications and paternal grandmother with severe alcoholism. Pts own drinking began at age 16, living in LA, had a DUI at age 17 after totaling a new car that his mother bought for him, he was married. """)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.rbtl3").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_rbtl3|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|228.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/rbtl3
- https://arxiv.org/abs/1906.08101
- https://github.com/google-research/bert
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/1906.08101
---
layout: model
title: Part of Speech for Breton
author: John Snow Labs
name: pos_ud_keb
date: 2020-07-29 23:34:00 +0800
task: Part of Speech Tagging
language: br
edition: Spark NLP 2.5.5
spark_version: 2.4
tags: [pos, br]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_keb_br_2.5.5_2.4_1596053588899.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_keb_br_2.5.5_2.4_1596053588899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_keb", "br") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Distaolit dimp hon dleoù evel m'hor bo ivez distaolet d'hon dleourion.")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_keb", "br")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("Distaolit dimp hon dleoù evel m"hor bo ivez distaolet d"hon dleourion.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Distaolit dimp hon dleoù evel m'hor bo ivez distaolet d'hon dleourion."""]
pos_df = nlu.load('br.pos').predict(text, output_level='token')
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=8, result='VERB', metadata={'word': 'Distaolit'}),
Row(annotatorType='pos', begin=10, end=13, result='VERB', metadata={'word': 'dimp'}),
Row(annotatorType='pos', begin=15, end=17, result='DET', metadata={'word': 'hon'}),
Row(annotatorType='pos', begin=19, end=23, result='NOUN', metadata={'word': 'dleoù'}),
Row(annotatorType='pos', begin=25, end=28, result='ADP', metadata={'word': 'evel'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_keb|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.5+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|br|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from arrandi)
author: John Snow Labs
name: xlmroberta_ner_arrandi_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `arrandi`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_arrandi_base_finetuned_panx_de_4.1.0_3.0_1660431029647.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_arrandi_base_finetuned_panx_de_4.1.0_3.0_1660431029647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_arrandi_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_arrandi_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_arrandi_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/arrandi/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Voice of the Patients (embeddings_clinical_large)
author: John Snow Labs
name: ner_vop_wip_emb_clinical_large
date: 2023-05-19
tags: [licensed, clinical, en, ner, vop, patient]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences.
Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases.
## Predicted Entities
`Gender`, `Employment`, `BodyPart`, `Age`, `PsychologicalCondition`, `Form`, `Vaccine`, `Drug`, `Substance`, `ClinicalDept`, `Laterality`, `DateTime`, `Test`, `VitalTest`, `Disease`, `Dosage`, `Route`, `Duration`, `Procedure`, `AdmissionDischarge`, `Symptom`, `Frequency`, `RelationshipStatus`, `HealthStatus`, `Allergen`, `Modifier`, `SubstanceQuantity`, `TestResult`, `MedicalDevice`, `Treatment`, `InjuryOrPoisoning`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_emb_clinical_large_en_4.4.2_3.0_1684511324500.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_emb_clinical_large_en_4.4.2_3.0_1684511324500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_wip_emb_clinical_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_wip_emb_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:---------------------|:-----------------------|
| 20 year old | Age |
| girl | Gender |
| hyperthyroid | Disease |
| 1 month ago | DateTime |
| weak | Symptom |
| light | Symptom |
| panic attacks | PsychologicalCondition |
| depression | PsychologicalCondition |
| left | Laterality |
| chest | BodyPart |
| pain | Symptom |
| increased | TestResult |
| heart rate | VitalTest |
| rapidly | Modifier |
| weight loss | Symptom |
| 4 months | Duration |
| hospital | ClinicalDept |
| discharged | AdmissionDischarge |
| hospital | ClinicalDept |
| blood tests | Test |
| brain | BodyPart |
| mri | Test |
| ultrasound scan | Test |
| endoscopy | Procedure |
| doctors | Employment |
| homeopathy doctor | Employment |
| he | Gender |
| hyperthyroid | Disease |
| TSH | Test |
| 0.15 | TestResult |
| T3 | Test |
| T4 | Test |
| normal | TestResult |
| b12 deficiency | Disease |
| vitamin D deficiency | Disease |
| weekly | Frequency |
| supplement | Drug |
| vitamin D | Drug |
| 1000 mcg | Dosage |
| b12 | Drug |
| daily | Frequency |
| homeopathy medicine | Drug |
| 40 days | Duration |
| after 30 days | DateTime |
| TSH | Test |
| 0.5 | TestResult |
| now | DateTime |
| weakness | Symptom |
| depression | PsychologicalCondition |
| last week | DateTime |
| rapid | TestResult |
| heartrate | VitalTest |
| allopathy medicine | Treatment |
| homeopathy | Treatment |
| thyroid | BodyPart |
| allopathy | Treatment |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_wip_emb_clinical_large|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.9 MB|
|Dependencies:|embeddings_clinical_large|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Gender 1294 25 23 1317 0.98 0.98 0.98
Employment 1171 47 72 1243 0.96 0.94 0.95
BodyPart 2710 199 190 2900 0.93 0.93 0.93
Age 539 47 43 582 0.92 0.93 0.92
PsychologicalCondition 417 45 27 444 0.90 0.94 0.92
Form 250 33 16 266 0.88 0.94 0.91
Vaccine 37 2 5 42 0.95 0.88 0.91
Drug 1311 144 129 1440 0.90 0.91 0.91
Substance 399 69 22 421 0.85 0.95 0.90
ClinicalDept 288 25 38 326 0.92 0.88 0.90
Laterality 538 47 90 628 0.92 0.86 0.89
DateTime 3992 602 410 4402 0.87 0.91 0.89
Test 1064 141 144 1208 0.88 0.88 0.88
VitalTest 154 32 18 172 0.83 0.90 0.86
Disease 1755 316 260 2015 0.85 0.87 0.86
Dosage 347 62 65 412 0.85 0.84 0.85
Route 41 7 7 48 0.85 0.85 0.85
Duration 1845 233 465 2310 0.89 0.80 0.84
Procedure 555 83 150 705 0.87 0.79 0.83
AdmissionDischarge 25 1 9 34 0.96 0.74 0.83
Symptom 3710 727 865 4575 0.84 0.81 0.82
Frequency 851 159 228 1079 0.84 0.79 0.81
RelationshipStatus 18 3 6 24 0.86 0.75 0.80
HealthStatus 83 29 24 107 0.74 0.78 0.76
Allergen 29 1 17 46 0.97 0.63 0.76
Modifier 783 189 356 1139 0.81 0.69 0.74
SubstanceQuantity 60 17 25 85 0.78 0.71 0.74
TestResult 364 114 160 524 0.76 0.69 0.73
MedicalDevice 225 56 107 332 0.80 0.68 0.73
Treatment 142 34 86 228 0.81 0.62 0.70
InjuryOrPoisoning 104 24 72 176 0.81 0.59 0.68
macro_avg 25101 3513 4129 29230 0.87 0.82 0.84
micro_avg 25101 3513 4129 29230 0.88 0.86 0.87
```
---
layout: model
title: Emotion Detection Classifier
author: John Snow Labs
name: bert_sequence_classifier_emotion
date: 2022-01-14
tags: [bert_for_sequence, en, emotion, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 3.3.4
spark_version: 3.0
supported: true
annotator: BertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model was imported from `Hugging Face` and it's been fine-tuned on emotion [dataset](https://huggingface.co/nateraw/bert-base-uncased-emotion), leveraging `Bert` embeddings and `BertForSequenceClassification` for text classification purposes.
## Predicted Entities
`sadness`, `joy`, `love`, `anger`, `fear`, `surprise`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_emotion_en_3.3.4_3.0_1642152012549.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_emotion_en_3.3.4_3.0_1642152012549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_emotion', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class')
pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])
example = spark.createDataFrame([["What do you mean? Are you kidding me?"]]).toDF("text")
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_emotion", "en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val example = Seq.empty["What do you mean? Are you kidding me?"].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.emotion.bert").predict("""What do you mean? Are you kidding me?""")
```
## Results
```bash
['anger']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_emotion|
|Compatibility:|Spark NLP 3.3.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|410.1 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## Data Source
[https://huggingface.co/datasets/viewer/?dataset=emotion](https://huggingface.co/datasets/viewer/?dataset=emotion)
## Benchmarking
NOTE: The author didn't share Precision / Recall / F1, only Validation Accuracy was shared as [Evaluation Results](https://huggingface.co/nateraw/bert-base-uncased-emotion#eval-results).
```bash
Validation Accuracy: 0.931
```
---
layout: model
title: Translate Mossi to English Pipeline
author: John Snow Labs
name: translate_mos_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, mos, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `mos`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mos_en_xx_2.7.0_2.4_1609689738585.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mos_en_xx_2.7.0_2.4_1609689738585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_mos_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_mos_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.mos.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_mos_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for ICD10-CM (general 3 character codes)
author: John Snow Labs
name: sbiobertresolve_icd10cm_generalised_augmented
date: 2023-05-24
tags: [licensed, en, clinical, entity_resolution]
task: Entity Resolution
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD-10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It predicts ICD-10-CM codes up to 3 characters (according to ICD-10-CM code structure the first three characters represent general type of the injury or disease).
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_augmented_en_4.4.2_3.0_1684930238103.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_augmented_en_4.4.2_3.0_1684930238103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
`sbiobertresolve_icd10cm_generalised_augmented` resolver model must be used with `sbiobert_base_cased_mli` as embeddings.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(['PROBLEM'])
chunk2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_generalised_augmented","en", "clinical/models")\
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunk2doc,
sbert_embedder,
icd10_resolver])
data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList("PROBLEM")
val chunk2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("ner_chunk_doc")
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_generalised_augmented","en", "clinical/models")
.setInputCols("sbert_embeddings")
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunk2doc,
sbert_embedder,
icd10_resolver))
val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPython.html %}
```python
...
neoplasm_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_neoplasms_clinical","en","clinical/models")\
.setInputCols("token","chunk_embeddings")\
.setOutputCol("entity")
pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, neoplasm_resolver])
model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text"))
results = model.transform(data)
```
```scala
...
val neoplasm_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_neoplasms_clinical","en","clinical/models")
.setInputCols(Array("token","chunk_embeddings"))
.setOutputCol("resolution")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, neoplasm_resolver))
val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
chunk entity icd10_neoplasm_description icd10_neoplasm_code
0 patient Organism Acute myelomonocytic leukemia, in remission C9251
1 infant Organism Malignant (primary) neoplasm, unspecified C801
2 nose Organ Malignant neoplasm of nasal cavity C300
3 She Organism Malignant neoplasm of thyroid gland C73
4 She Organism Malignant neoplasm of thyroid gland C73
5 She Organism Malignant neoplasm of thyroid gland C73
6 Aldex Gene_or_gene_product Acute megakaryoblastic leukemia not having ach... C9420
7 ear Organ Other benign neoplasm of skin of right ear and... D2321
8 She Organism Malignant neoplasm of thyroid gland C73
9 She Organism Malignant neoplasm of thyroid gland C73
10 She Organism Malignant neoplasm of thyroid gland C73
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|-----------------------------------------|
| Name: | chunkresolve_icd10cm_neoplasms_clinical |
| Type: | ChunkEntityResolverModel |
| Compatibility: | Spark NLP 2.4.5+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | [token, chunk_embeddings] |
|Output labels: | [entity] |
| Language: | en |
| Case sensitive: | True |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on ICD10CM Dataset Ranges: C000-D489, R590-R599
https://www.icd10data.com/ICD10CM/Codes/C00-D49
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186 TFWav2Vec2ForCTC from Sarahliu186
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186` is a English model originally trained by Sarahliu186.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186_en_4.2.0_3.0_1664114424756.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186_en_4.2.0_3.0_1664114424756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English image_classifier_vit_fancy_animales ViTForImageClassification from andy-0v0
author: John Snow Labs
name: image_classifier_vit_fancy_animales
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_fancy_animales` is a English model originally trained by andy-0v0.
## Predicted Entities
`penguin`, `chow chow`, `sloth`, `wombat`, `panda`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_fancy_animales_en_4.1.0_3.0_1660165986402.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_fancy_animales_en_4.1.0_3.0_1660165986402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_fancy_animales", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_fancy_animales", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_fancy_animales|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Legal Authorizations Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_authorizations_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, authorizations, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Authorizations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Authorizations`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_authorizations_bert_en_1.0.0_3.0_1678050707023.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_authorizations_bert_en_1.0.0_3.0_1678050707023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Authorizations]|
|[Other]|
|[Other]|
|[Authorizations]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_authorizations_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Authorizations 0.93 0.96 0.95 73
Other 0.97 0.95 0.96 97
accuracy - - 0.95 170
macro-avg 0.95 0.95 0.95 170
weighted-avg 0.95 0.95 0.95 170
```
---
layout: model
title: Extract relations between problem, treatment and test entities (ReDL)
author: John Snow Labs
name: redl_clinical_biobert
date: 2021-07-24
tags: [en, licensed, relation_extraction, clinical]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 3.0.3
spark_version: 2.4
supported: true
annotator: RelationExtractionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract relations like `TrIP` : a certain treatment has improved a medical problem and 7 other such relations between problem, treatment and test entities.
## Predicted Entities
`PIP`, `TeCP`, `TeRP`, `TrAP`, `TrCP`, `TrIP`, `TrNAP`, `TrWP`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_3.0.3_2.4_1627118222780.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_3.0.3_2.4_1627118222780.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = sparknlp.annotators.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel() \
.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens", "embeddings"]) \
.setOutputCol("ner_tags")
ner_converter = NerConverter() \
.setInputCols(["sentences", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentences", "pos_tags", "tokens"]) \
.setOutputCol("dependencies")
# Set a filter on pairs of named entities which will be treated as relation candidates
re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setMaxSyntacticDistance(10)\
.setOutputCol("re_ner_chunks")\
.setRelationPairs(["problem-test", "problem-treatment"])
# The dataset this model is trained to is sentence-wise.
# This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
re_model = RelationExtractionDLModel()\
.pretrained('redl_clinical_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model])
text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ),
one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely .
She had close follow-up with endocrinology post discharge .
"""
data = spark.createDataFrame([[text]]).toDF("text")
p_model = pipeline.fit(data)
result = p_model.transform(data)
```
```scala
...
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentences"))
.setOutputCol("tokens")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverter()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val re_ner_chunk_filter = RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
.setRelationPairs(Array("problem-test", "problem-treatment"))
// The dataset this model is trained to is sentence-wise.
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val re_model = RelationExtractionDLModel()
.pretrained("redl_clinical_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model))
val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ),
one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely .
She had close follow-up with endocrinology post discharge .
""")
```
## Results
```bash
+--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+
|relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence|
+--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+
| TrAP|TREATMENT| 512| 522| amoxicillin| PROBLEM| 528| 556|a respiratory tra...|0.99796957|
| TrAP|TREATMENT| 571| 579| metformin| PROBLEM| 617| 620| T2DM|0.99757993|
| TrAP|TREATMENT| 599| 611| dapagliflozin| PROBLEM| 659| 661| HTG| 0.996036|
| TrAP| PROBLEM| 617| 620| T2DM|TREATMENT| 626| 637| atorvastatin| 0.9693424|
| TrAP| PROBLEM| 617| 620| T2DM|TREATMENT| 643| 653| gemfibrozil|0.99460286|
| TeRP| TEST| 739| 758|Physical examination| PROBLEM| 796| 810| dry oral mucosa|0.99775106|
| TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 875| 884| tenderness|0.99272937|
| TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 888| 895| guarding| 0.9840321|
| TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 902| 909| rigidity| 0.9883966|
| TeRP| TEST| 1246| 1258| blood samples| PROBLEM| 1265| 1274| hemolyzing| 0.9534202|
| TeRP| TEST| 1507| 1517| her glucose| PROBLEM| 1553| 1566| still elevated| 0.9464761|
| TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1576| 1592| serum bicarbonate| 0.9428323|
| TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1656| 1661| lipase| 0.9558198|
| TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1670| 1672| U/L| 0.9214444|
| TeRP| TEST| 1676| 1702|The β-hydroxybuty...| PROBLEM| 1733| 1740| elevated| 0.9863963|
| TrAP|TREATMENT| 1937| 1951| an insulin drip| PROBLEM| 1957| 1961| euDKA| 0.9852455|
| O| PROBLEM| 1957| 1961| euDKA| TEST| 1991| 2003| the anion gap|0.94141793|
| O| PROBLEM| 1957| 1961| euDKA| TEST| 2015| 2027| triglycerides| 0.9622529|
+--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_clinical_biobert|
|Compatibility:|Healthcare NLP 3.0.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Case sensitive:|true|
## Data Source
Trained on 2010 i2b2 relation challenge.
## Benchmarking
```bash
Relation Recall Precision F1 Support
PIP 0.859 0.878 0.869 1435
TeCP 0.629 0.782 0.697 337
TeRP 0.903 0.929 0.916 2034
TrAP 0.872 0.866 0.869 1693
TrCP 0.641 0.677 0.659 340
TrIP 0.517 0.796 0.627 151
TrNAP 0.402 0.672 0.503 112
TrWP 0.257 0.824 0.392 109
Avg. 0.635 0.803 0.691 -
```
---
layout: model
title: Legal NER for NDA (Definition of Confidential Information Clauses)
author: John Snow Labs
name: legner_nda_def_conf_info
date: 2023-04-10
tags: [en, licensed, legal, ner, nda, definition]
task: Named Entity Recognition
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a NER model, aimed to be run **only** after detecting the `DEF_OF_CONF_INFO` clause with a proper classifier (use `legmulticlf_mnda_sections_paragraph_other` model for that purpose). It will extract the following entities: `CONF_INFO_FORM`, and `CONF_INFO_TYPE`.
## Predicted Entities
`CONF_INFO_FORM`, `CONF_INFO_TYPE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_def_conf_info_en_1.0.0_3.0_1681152951608.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_def_conf_info_en_1.0.0_3.0_1681152951608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_nda_def_conf_info", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = [""""Confidential Information" shall mean all written or oral information of a proprietary, intellectual, or similar nature relating to GT Solar's business, projects, operations, activities, or affairs whether of a technical or financial nature or otherwise (including, without limitation, reports, financial information, business plans and proposals, ideas, concepts, trade secrets, know-how, processes, and other technical or business information, whether concerning GT Solar' businesses or otherwise) which has not been publicly disclosed and which the Recipient acquires directly or indirectly from GT Solar, its officers, employees, affiliates, agents or representatives."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+-------------+--------------+
|chunk |ner_label |
+-------------+--------------+
|written |CONF_INFO_FORM|
|oral |CONF_INFO_FORM|
|reports |CONF_INFO_TYPE|
|information |CONF_INFO_TYPE|
|plans |CONF_INFO_TYPE|
|proposals |CONF_INFO_TYPE|
|ideas |CONF_INFO_TYPE|
|concepts |CONF_INFO_TYPE|
|trade secrets|CONF_INFO_TYPE|
|know-how |CONF_INFO_TYPE|
|processes |CONF_INFO_TYPE|
|information |CONF_INFO_TYPE|
+-------------+--------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_nda_def_conf_info|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.3 MB|
## References
In-house annotations on the Non-disclosure Agreements
## Benchmarking
```bash
label precision recall f1-score support
CONF_INFO_FORM 1.00 0.95 0.97 20
CONF_INFO_TYPE 0.87 0.93 0.90 163
micro-avg 0.88 0.93 0.90 183
macro-avg 0.93 0.94 0.94 183
weighted-avg 0.88 0.93 0.90 183
```
---
layout: model
title: Translate Basque (family) to English Pipeline
author: John Snow Labs
name: translate_euq_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, euq, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `euq`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_euq_en_xx_2.7.0_2.4_1609687608095.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_euq_en_xx_2.7.0_2.4_1609687608095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_euq_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_euq_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.euq.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_euq_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Extract Demographic Entities from Oncology Texts
author: John Snow Labs
name: ner_oncology_demographics
date: 2022-11-24
tags: [licensed, clinical, en, ner, oncology, demographics]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts demographic information from oncology texts, including age, gender, and smoking status.
Definitions of Predicted Entities:
- `Age`: All mention of ages, past or present, related to the patient or with anybody else.
- `Gender`: Gender-specific nouns and pronouns (including words such as "him" or "she", and family members such as "father").
- `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups.
- `Smoking_Status`: All mentions of smoking related to the patient or to someone else.
## Predicted Entities
`Age`, `Gender`, `Race_Ethnicity`, `Smoking_Status`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_en_4.2.2_3.0_1669300163954.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_en_4.2.2_3.0_1669300163954.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_demographics", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The patient is a 40-year-old man with history of heavy smoking."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_demographics", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The patient is a 40-year-old man with history of heavy smoking.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_demographics").predict("""The patient is a 40-year-old man with history of heavy smoking.""")
```
## Results
```bash
| chunk | ner_label |
|:------------|:---------------|
| 40-year-old | Age |
| man | Gender |
| smoking | Smoking_Status |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_demographics|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|34.6 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Smoking_Status 60 19 8 68 0.76 0.88 0.82
Age 934 33 15 949 0.97 0.98 0.97
Race_Ethnicity 57 5 5 62 0.92 0.92 0.92
Gender 1248 18 6 1254 0.99 1.00 0.99
macro_avg 2299 75 34 2333 0.91 0.95 0.93
micro_avg 2299 75 34 2333 0.97 0.99 0.98
```
---
layout: model
title: Legal Non Disparagement Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_non_disparagement_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, non_disparagement, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Non_Disparagement` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Non_Disparagement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_disparagement_bert_en_1.0.0_3.0_1678049568901.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_disparagement_bert_en_1.0.0_3.0_1678049568901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Non_Disparagement]|
|[Other]|
|[Other]|
|[Non_Disparagement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_non_disparagement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Non_Disparagement 0.98 0.98 0.98 41
Other 0.98 0.98 0.98 59
accuracy - - 0.98 100
macro-avg 0.98 0.98 0.98 100
weighted-avg 0.98 0.98 0.98 100
```
---
layout: model
title: Financial English BERT Embeddings (Base)
author: John Snow Labs
name: bert_embeddings_sec_bert_base
date: 2022-04-12
tags: [bert, embeddings, en, open_source, financial]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Financial Pretrained BERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `sec-bert-base` is a English model orginally trained by `nlpaueb`. This is the reference base model, what means it uses the same architecture as BERT-BASE trained on financial documents.
If you are interested in Financial Embeddings, take a look also at these two models:
- [sec-num](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_num_en_3_0.html): Same as this base model but we replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation).
- [sec-shape](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_sh_en_3_0.html): Same as this base model but we replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_base_en_3.4.2_3.0_1649759502537.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_base_en_3.4.2_3.0_1649759502537.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.sec_bert_base").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_sec_bert_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|409.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/nlpaueb/sec-bert-base
- https://arxiv.org/abs/2203.06482
- http://nlp.cs.aueb.gr/
---
layout: model
title: English BertForQuestionAnswering Cased model (from ericw0530)
author: John Snow Labs
name: bert_qa_ericw0530_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `ericw0530`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ericw0530_finetuned_squad_en_4.0.0_3.0_1657186496979.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ericw0530_finetuned_squad_en_4.0.0_3.0_1657186496979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ericw0530_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_ericw0530_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_ericw0530_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ericw0530/bert-finetuned-squad
---
layout: model
title: English RobertaForQuestionAnswering Tiny Cased model (from deepset)
author: John Snow Labs
name: roberta_qa_tiny_squad2_step1
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinyroberta-squad2-step1` is a English model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_squad2_step1_en_4.3.0_3.0_1674224441422.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_squad2_step1_en_4.3.0_3.0_1674224441422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_squad2_step1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_squad2_step1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_tiny_squad2_step1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|307.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/deepset/tinyroberta-squad2-step1
---
layout: model
title: Legal Limited Partnership Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_limited_partnership_agreement_bert
date: 2022-11-24
tags: [en, legal, classification, agreement, limited_partnership, licensed, bert]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_limited_partnership_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `limited-partnership-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`limited-partnership-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_limited_partnership_agreement_bert_en_1.0.0_3.0_1669315953601.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_limited_partnership_agreement_bert_en_1.0.0_3.0_1669315953601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[limited-partnership-agreement]|
|[other]|
|[other]|
|[limited-partnership-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_limited_partnership_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
limited-partnership-agreement 1.00 1.00 1.0 22
other 1.00 1.00 1.0 41
accuracy - - 1.0 63
macro-avg 1.00 1.00 1.0 63
weighted-avg 1.00 1.00 1.0 63
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from avioo1)
author: John Snow Labs
name: distilbert_qa_avioo1_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `avioo1`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_avioo1_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770116079.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_avioo1_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770116079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_avioo1_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_avioo1_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_avioo1_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/avioo1/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Recognize Entities DL pipeline for French - Large
author: John Snow Labs
name: entity_recognizer_lg
date: 2021-03-23
tags: [open_source, french, entity_recognizer_lg, pipeline, fr]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: fr
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_fr_3.0.0_3.0_1616461515226.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_fr_3.0.0_3.0_1616461515226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('entity_recognizer_lg', lang = 'fr')
annotations = pipeline.fullAnnotate(""Bonjour de John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("entity_recognizer_lg", lang = "fr")
val result = pipeline.fullAnnotate("Bonjour de John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Bonjour de John Snow Labs! ""]
result_df = nlu.load('fr.ner').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:--------------------------------|:-------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Bonjour de John Snow Labs! '] | ['Bonjour de John Snow Labs!'] | ['Bonjour', 'de', 'John', 'Snow', 'Labs!'] | [[-0.010997000150382,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fr|
---
layout: model
title: Legal Captions Clause Binary Classifier
author: John Snow Labs
name: legclf_captions_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `captions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `captions`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_captions_clause_en_1.0.0_3.2_1660123284259.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_captions_clause_en_1.0.0_3.2_1660123284259.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[captions]|
|[other]|
|[other]|
|[captions]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_captions_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
captions 0.96 1.00 0.98 50
other 1.00 0.98 0.99 105
accuracy - - 0.99 155
macro-avg 0.98 0.99 0.99 155
weighted-avg 0.99 0.99 0.99 155
```
---
layout: model
title: ICD10PCS Entity Resolver
author: John Snow Labs
name: chunkresolve_icd10pcs_clinical
class: ChunkEntityResolverModel
language: en
nav_key: models
repository: clinical/models
date: 2020-04-21
task: Entity Resolution
edition: Healthcare NLP 2.4.2
spark_version: 2.4
tags: [clinical,licensed,entity_resolution,en]
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance.
## Predicted Entities
ICD10-PCS Codes and their normalized definition with `clinical_embeddings`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10pcs_clinical_en_2.4.5_2.4_1587491320087.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10pcs_clinical_en_2.4.5_2.4_1587491320087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
...
model = ChunkEntityResolverModel.pretrained("chunkresolve_icd10pcs_clinical","en","clinical/models")
.setInputCols("token","chunk_embeddings")
.setOutputCol("entity")
pipeline_icd10pcs = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, ner, chunk_embeddings, model])
data = ["""He has a starvation ketosis but nothing found for significant for dry oral mucosa"""]
pipeline_model = pipeline_icd10pcs.fit(spark.createDataFrame([[""]]).toDF("text"))
light_pipeline = LightPipeline(pipeline_model)
result = light_pipeline.annotate(data)
```
```scala
...
val model = ChunkEntityResolverModel.pretrained("chunkresolve_icd10pcs_clinical","en","clinical/models")
.setInputCols("token","chunk_embeddings")
.setOutputCol("entity")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, ner, chunk_embeddings, model))
val data = Seq("He has a starvation ketosis but nothing found for significant for dry oral mucosa").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
| | chunks | begin | end | code | resolutions |
|---|----------------------|-------|-----|---------|--------------------------------------------------|
| 0 | a starvation ketosis | 7 | 26 | 6A3Z1ZZ | Hyperthermia, Multiple:::Narcosynthesis:::Hype...|
| 1 | dry oral mucosa | 66 | 80 | 8E0ZXY4 | Yoga Therapy:::Release Cecum, Open Approach:::...|
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|--------------------------------|
| Name: | chunkresolve_icd10pcs_clinical |
| Type: | ChunkEntityResolverModel |
| Compatibility: | Spark NLP 2.4.2+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | token, chunk_embeddings |
|Output labels: | entity |
| Language: | en |
| Case sensitive: | True |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on ICD10 Procedure Coding System dataset
https://www.icd10data.com/ICD10PCS/Codes
---
layout: model
title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot TFWav2Vec2ForCTC from aapot
author: John Snow Labs
name: pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot` is a Finnish model originally trained by aapot.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_fi_4.2.0_3.0_1664022498179.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_fi_4.2.0_3.0_1664022498179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot', lang = 'fi')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot", lang = "fi")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|3.6 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English image_classifier_vit_roomidentifier ViTForImageClassification from lazyturtl
author: John Snow Labs
name: image_classifier_vit_roomidentifier
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_roomidentifier` is a English model originally trained by lazyturtl.
## Predicted Entities
`Kitchen`, `Bedroom`, `Bathroom`, `DinningRoom`, `LivingRoom`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_roomidentifier_en_4.1.0_3.0_1660168339182.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_roomidentifier_en_4.1.0_3.0_1660168339182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_roomidentifier", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_roomidentifier", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_roomidentifier|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_kv256
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-kv256` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv256_en_4.3.0_3.0_1675121414682.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv256_en_4.3.0_3.0_1675121414682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_kv256","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_kv256","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_kv256|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|256.0 MB|
## References
- https://huggingface.co/google/t5-efficient-small-kv256
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Fast Neural Machine Translation Model from Ga to English
author: John Snow Labs
name: opus_mt_gaa_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, gaa, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `gaa`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_gaa_en_xx_2.7.0_2.4_1609163836574.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_gaa_en_xx_2.7.0_2.4_1609163836574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_gaa_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_gaa_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.gaa.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_gaa_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Chinese Word Segmentation
author: John Snow Labs
name: wordseg_pku
date: 2021-01-03
task: Word Segmentation
language: zh
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, word_segmentation, cn, zh]
supported: true
annotator: WordSegmenterModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.
References:
- Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_pku_zh_2.7.0_2.4_1609694210774.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_pku_zh_2.7.0_2.4_1609694210774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
word_segmenter = WordSegmenterModel.pretrained('wordseg_msr', 'zh')\
.setInputCols("document")\
.setOutputCol("token")
pipeline = Pipeline(stages=[document_assembler, word_segmenter])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"])
result = model.transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_pku", "zh")
.setInputCols("document")
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""然而,这样的处理也衍生了一些问题。"""]
ner_df = nlu.load('zh.segment_words.pku').predict(text, output_level='token')
ner_df
```
## Results
```bash
+----------------------------------+--------------------------------------------------------+
|text |result |
+----------------------------------+--------------------------------------------------------+
|然而,这样的处理也衍生了一些问题。|[然而, ,, 这样, 的, 处理, 也, 衍生, 了, 一些, 问题, 。]|
+----------------------------------+--------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wordseg_pku|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[token]|
|Language:|zh|
## Data Source
The model is trained on the Pekin University (PKU) data set available on the Second International Chinese Word Segmentation Bakeoff [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/).
## Benchmarking
```bash
| Model | precision | recall | f1-score |
|---------------|--------------|--------------|--------------|
| WORSEG_CTB | 0,6453 | 0,6341 | 0,6397 |
| WORDSEG_WEIBO | 0,5454 | 0,5655 | 0,5553 |
| WORDSEG_MSR | 0,5984 | 0,6088 | 0,6035 |
| WORDSEG_PKU | 0,6094 | 0,6321 | 0,6206 |
| WORDSEG_LARGE | 0,6326 | 0,6269 | 0,6297 |
```
---
layout: model
title: Entity Recognizer LG
author: John Snow Labs
name: entity_recognizer_lg
date: 2022-06-25
tags: ["no", open_source]
task: Named Entity Recognition
language: "no"
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_no_4.0.0_3.0_1656124813478.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_no_4.0.0_3.0_1656124813478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("entity_recognizer_lg", "no")
result = pipeline.annotate("""I love johnsnowlabs! """)
```
{:.nlu-block}
```python
import nlu
nlu.load("no.ner.lg").predict("""I love johnsnowlabs! """)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|no|
|Size:|2.5 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- NerDLModel
- NerConverter
---
layout: model
title: Word2Vec Embeddings in Somali (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, so, open_source]
task: Embeddings
language: so
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_so_3.4.1_3.0_1647458819071.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_so_3.4.1_3.0_1647458819071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","so") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Waan jeclahay Spark Nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","so")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Waan jeclahay Spark Nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("so.embed.w2v_cc_300d").predict("""Waan jeclahay Spark Nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|so|
|Size:|98.5 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Sentiment Analysis of German texts
author: John Snow Labs
name: classifierdl_bert_sentiment
date: 2021-09-09
tags: [de, sentiment, classification, open_source]
task: Sentiment Analysis
language: de
edition: Spark NLP 3.2.0
spark_version: 2.4
supported: true
annotator: ClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model identifies the sentiments (positive or negative) in German texts.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_DE/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_De_SENTIMENT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_de_3.2.0_2.4_1631184887201.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_de_3.2.0_2.4_1631184887201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
embeddings = BertSentenceEmbeddings\
.pretrained('labse', 'xx') \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "de") \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")
fr_sentiment_pipeline = Pipeline(stages=[document, embeddings, sentimentClassifier])
light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
result1 = light_pipeline.annotate("Spiel und Meisterschaft nicht spannend genug? Muss man jetzt den Videoschiedsrichter kontrollieren? Ich bin entsetzt...dachte der darf nur bei krassen Fehlentscheidungen ran. So macht der Fussball keinen Spass mehr.")
result2 = light_pipeline.annotate("Habe gestern am Mittwoch den #werder Podcast vermisst. Wie schnell man sich an etwas gewöhnt und darauf freut. Danke an @Plainsman74 für die guten Interviews und den Einblick hinter die Kulissen von @werderbremen. Angenehme Winterpause weiterhin!")
print(result1["class"], result2["class"], sep = "\n")
```
```scala
val document = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val embeddings = BertSentenceEmbeddings
.pretrained("labse", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence_embeddings")
val sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "de")
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val fr_sentiment_pipeline = new Pipeline().setStages(Array(document, embeddings, sentimentClassifier))
val light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
val result1 = light_pipeline.annotate("Spiel und Meisterschaft nicht spannend genug? Muss man jetzt den Videoschiedsrichter kontrollieren? Ich bin entsetzt...dachte der darf nur bei krassen Fehlentscheidungen ran. So macht der Fussball keinen Spass mehr.")
val result2 = light_pipeline.annotate("Habe gestern am Mittwoch den #werder Podcast vermisst. Wie schnell man sich an etwas gewöhnt und darauf freut. Danke an @Plainsman74 für die guten Interviews und den Einblick hinter die Kulissen von @werderbremen. Angenehme Winterpause weiterhin!")
```
{:.nlu-block}
```python
import nlu
nlu.load("de.classify.sentiment.bert").predict("""Habe gestern am Mittwoch den #werder Podcast vermisst. Wie schnell man sich an etwas gewöhnt und darauf freut. Danke an @Plainsman74 für die guten Interviews und den Einblick hinter die Kulissen von @werderbremen. Angenehme Winterpause weiterhin!""")
```
## Results
```bash
['NEGATIVE']
['POSITIVE']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_bert_sentiment|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|de|
## Data Source
https://github.com/charlesmalafosse/open-dataset-for-sentiment-analysis/
## Benchmarking
```bash
label precision recall f1-score support
NEGATIVE 0.83 0.85 0.84 978
POSITIVE 0.94 0.93 0.94 2582
accuracy - - 0.91 3560
macro-avg 0.89 0.89 0.89 3560
weighted-avg 0.91 0.91 0.91 3560
```
---
layout: model
title: Moldavian, Moldovan, Romanian asr_wav2vec2_large_xlsr_53_romanian_by_anton_l TFWav2Vec2ForCTC from anton-l
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_53_romanian_by_anton_l
date: 2022-09-25
tags: [wav2vec2, ro, audio, open_source, asr]
task: Automatic Speech Recognition
language: ro
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_romanian_by_anton_l` is a Moldavian, Moldovan, Romanian model originally trained by anton-l.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_ro_4.2.0_3.0_1664098684581.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_ro_4.2.0_3.0_1664098684581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_53_romanian_by_anton_l", "ro")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_53_romanian_by_anton_l", "ro")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_53_romanian_by_anton_l|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|ro|
|Size:|1.2 GB|
---
layout: model
title: Pipeline to Mapping MESH Codes with Their Corresponding UMLS Codes
author: John Snow Labs
name: mesh_umls_mapping
date: 2023-06-13
tags: [en, licensed, clinical, resolver, pipeline, chunk_mapping, mesh, umls]
task: Chunk Mapping
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of `mesh_umls_mapper` model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_4.4.4_3.2_1686663527159.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_4.4.4_3.2_1686663527159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models")
result = pipeline.fullAnnotate(C028491 D019326 C579867)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models")
val result = pipeline.fullAnnotate(C028491 D019326 C579867)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.mesh.umls.mapping").predict("""Put your text here.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models")
result = pipeline.fullAnnotate(C028491 D019326 C579867)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models")
val result = pipeline.fullAnnotate(C028491 D019326 C579867)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.mesh.umls.mapping").predict("""Put your text here.""")
```
## Results
```bash
Results
| | mesh_code | umls_code |
|---:|:----------------------------|:-------------------------------|
| 0 | C028491 | D019326 | C579867 | C0043904 | C0045010 | C3696376 |
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|mesh_umls_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|3.9 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ChunkMapperModel
---
layout: model
title: English Bert Embeddings (from monsoon-nlp)
author: John Snow Labs
name: bert_embeddings_muril_adapted_local
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a English model orginally trained by `monsoon-nlp`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_en_3.4.2_3.0_1649672705449.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_en_3.4.2_3.0_1649672705449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.muril_adapted_local").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_muril_adapted_local|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|888.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/monsoon-nlp/muril-adapted-local
- https://tfhub.dev/google/MuRIL/1
---
layout: model
title: Arabic Bert Embeddings (from MutazYoune)
author: John Snow Labs
name: bert_embeddings_Ara_DialectBERT
date: 2022-04-11
tags: [bert, embeddings, ar, open_source]
task: Embeddings
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `Ara_DialectBERT` is a Arabic model orginally trained by `MutazYoune`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_Ara_DialectBERT_ar_3.4.2_3.0_1649678666850.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_Ara_DialectBERT_ar_3.4.2_3.0_1649678666850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_Ara_DialectBERT","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_Ara_DialectBERT","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.Ara_DialectBERT").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_Ara_DialectBERT|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|409.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/MutazYoune/Ara_DialectBERT
- https://github.com/elnagara/HARD-Arabic-Dataset
---
layout: model
title: Fast Neural Machine Translation Model from Kinyarwanda to English
author: John Snow Labs
name: opus_mt_rw_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, rw, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `rw`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_rw_en_xx_2.7.0_2.4_1609164993136.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_rw_en_xx_2.7.0_2.4_1609164993136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_rw_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_rw_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.rw.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_rw_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Oncology Pipeline for Therapies
author: John Snow Labs
name: oncology_therapy_pipeline
date: 2022-12-01
tags: [licensed, pipeline, oncology, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline includes Named-Entity Recognition and Assertion Status models to extract information from oncology texts. This pipeline focuses on entities related to therapies.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_therapy_pipeline_en_4.2.2_3.0_1669906146446.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_therapy_pipeline_en_4.2.2_3.0_1669906146446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("oncology_therapy_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.")[0]
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("oncology_therapy_pipeline", "en", "clinical/models")
val result = pipeline.fullAnnotate("""The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.""")(0)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.oncology_therpay.pipeline").predict("""The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.""")
```
## Results
```bash
******************** ner_oncology_wip results ********************
| chunk | ner_label |
|:-----------------|:---------------|
| mastectomy | Cancer_Surgery |
| second cycle | Cycle_Number |
| adriamycin | Chemotherapy |
| cyclophosphamide | Chemotherapy |
******************** ner_oncology_wip results ********************
| chunk | ner_label |
|:-----------------|:---------------|
| mastectomy | Cancer_Surgery |
| second cycle | Cycle_Number |
| adriamycin | Chemotherapy |
| cyclophosphamide | Chemotherapy |
******************** ner_oncology_wip results ********************
| chunk | ner_label |
|:-----------------|:---------------|
| mastectomy | Cancer_Surgery |
| second cycle | Cycle_Number |
| adriamycin | Cancer_Therapy |
| cyclophosphamide | Cancer_Therapy |
******************** ner_oncology_unspecific_posology_wip results ********************
| chunk | ner_label |
|:-----------------|:---------------------|
| mastectomy | Cancer_Therapy |
| second cycle | Posology_Information |
| adriamycin | Cancer_Therapy |
| cyclophosphamide | Cancer_Therapy |
******************** assertion_oncology_wip results ********************
| chunk | ner_label | assertion |
|:-----------------|:---------------|:------------|
| mastectomy | Cancer_Surgery | Past |
| adriamycin | Chemotherapy | Present |
| cyclophosphamide | Chemotherapy | Present |
******************** assertion_oncology_treatment_binary_wip results ********************
| chunk | ner_label | assertion |
|:-----------------|:---------------|:----------------|
| mastectomy | Cancer_Surgery | Present_Or_Past |
| adriamycin | Chemotherapy | Present_Or_Past |
| cyclophosphamide | Chemotherapy | Present_Or_Past |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|oncology_therapy_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- ChunkMergeModel
- ChunkMergeModel
- AssertionDLModel
- AssertionDLModel
---
layout: model
title: ESG Text Classification (Augmented, 26 classes)
author: John Snow Labs
name: finclf_augmented_esg
date: 2022-09-06
tags: [en, financial, esg, classification, licensed]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
recommended: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model classifies financial texts / news into 26 ESG classes which belong to three verticals: Environment, Social and Governance. This model can be use to build a ESG score board for companies.
If you look for generic version, only returning Environment, Social or Governance, please look for the finance_sequence_classifier_esg model in Models Hub.
## Predicted Entities
`Business_Ethics`, `Data_Security`, `Access_And_Affordability`, `Business_Model_Resilience`, `Competitive_Behavior`, `Critical_Incident_Risk_Management`, `Customer_Welfare`, `Director_Removal`, `Employee_Engagement_Inclusion_And_Diversity`, `Employee_Health_And_Safety`, `Human_Rights_And_Community_Relations`, `Labor_Practices`, `Management_Of_Legal_And_Regulatory_Framework`, `Physical_Impacts_Of_Climate_Change`, `Product_Quality_And_Safety`, `Product_Design_And_Lifecycle_Management`, `Selling_Practices_And_Product_Labeling`, `Supply_Chain_Management`, `Systemic_Risk_Management`, `Waste_And_Hazardous_Materials_Management`, `Water_And_Wastewater_Management`, `Air_Quality`, `Customer_Privacy`, `Ecological_Impacts`, `Energy_Management`, `GHG_Emissions`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/FINCLF_ESG/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_augmented_esg_en_1.0.0_3.2_1662473372920.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_augmented_esg_en_1.0.0_3.2_1662473372920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = nlp.Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = finance.BertForSequenceClassification.pretrained("finclf_augmented_esg", "en", "finance/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
# couple of simple examples
example = spark.createDataFrame([["""The Canadian Environmental Assessment Agency (CEAA) concluded that in June 2016 the company had not made an effort
to protect public drinking water and was ignoring concerns raised by its own scientists about the potential levels of pollutants in the local water supply.
At the time, there were concerns that the company was not fully testing onsite wells for contaminants and did not use the proper methods for testing because
of its test kits now manufactured in China.A preliminary report by the company in June 2016 was commissioned by the Alberta government to provide recommendations
to Alberta Environment officials"""]]).toDF("text")
result = pipeline.fit(example).transform(example)
# result is a DataFrame
result.select("text", "class.result").show()
```
## Results
```bash
+--------------------+--------------------+
| text| result|
+--------------------+--------------------+
|The Canadian Envi...|[Waste_And_Hazard...|
+--------------------+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_augmented_esg|
|Type:|finance|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|410.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
In-house annotations from scrapped annual reports and tweets about ESG
## Benchmarking
```bash
label precision recall f1-score support
Business_Ethics 0.73 0.80 0.76 10
Data_Security 1.00 0.89 0.94 9
Access_And_Affordability 1.00 1.00 1.00 15
Business_Model_Resilience 1.00 1.00 1.00 12
Competitive_Behavior 0.92 1.00 0.96 12
Critical_Incident_Risk_Management 0.92 1.00 0.96 11
Customer_Welfare 0.85 1.00 0.92 11
Director_Removal 0.91 1.00 0.95 10
Employee_Engagement_Inclusion_And_Diversity 1.00 1.00 1.00 11
Employee_Health_And_Safety 1.00 1.00 1.00 10
Human_Rights_And_Community_Relations 0.94 1.00 0.97 16
Labor_Practices 0.71 0.53 0.61 19
Management_Of_Legal_And_Regulatory_Framework 1.00 0.95 0.97 19
Physical_Impacts_Of_Climate_Change 0.93 1.00 0.97 14
Product_Quality_And_Safety 1.00 1.00 1.00 14
Product_Design_And_Lifecycle_Management 1.00 1.00 1.00 18
Selling_Practices_And_Product_Labeling 1.00 1.00 1.00 17
Supply_Chain_Management 0.89 1.00 0.94 8
Systemic_Risk_Management 1.00 0.86 0.92 14
Waste_And_Hazardous_Materials_Management 0.88 1.00 0.93 14
Water_And_Wastewater_Management 1.00 1.00 1.00 8
Air_Quality 1.00 1.00 1.00 16
Customer_Privacy 1.00 0.93 0.97 15
Ecological_Impacts 1.00 1.00 1.00 16
Energy_Management 1.00 0.91 0.95 11
GHG_Emissions 1.00 0.91 0.95 11
accuracy - - 0.95 330
macro-avg 0.95 0.95 0.95 330
weighted-avg 0.95 0.95 0.95 330
```
---
layout: model
title: English ElectraForQuestionAnswering Small model (from Palak)
author: John Snow Labs
name: electra_qa_google_small_discriminator_squad
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `google_electra-small-discriminator_squad` is a English model originally trained by `Palak`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_google_small_discriminator_squad_en_4.0.0_3.0_1655922022343.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_google_small_discriminator_squad_en_4.0.0_3.0_1655922022343.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_google_small_discriminator_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_google_small_discriminator_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.electra.small.by_Palak").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_google_small_discriminator_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|51.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Palak/google_electra-small-discriminator_squad
---
layout: model
title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011)
author: John Snow Labs
name: distilbert_token_classifier_autotrain_final_784824218
date: 2023-03-14
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824218` is a English model originally trained by `Lucifermorningstar011`.
## Predicted Entities
`9`, `0`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824218_en_4.3.1_3.0_1678783236100.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824218_en_4.3.1_3.0_1678783236100.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824218","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824218","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_autotrain_final_784824218|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Lucifermorningstar011/autotrain-final-784824218
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_fpdm_triplet_roberta_FT_new_newsqa
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_triplet_roberta_FT_new_newsqa` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_roberta_FT_new_newsqa_en_4.0.0_3.0_1655728767027.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_roberta_FT_new_newsqa_en_4.0.0_3.0_1655728767027.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_triplet_roberta_FT_new_newsqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_fpdm_triplet_roberta_FT_new_newsqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.news.roberta.qa_fpdm_triplet_roberta_ft_new_newsqa.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_fpdm_triplet_roberta_FT_new_newsqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|461.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/fpdm_triplet_roberta_FT_new_newsqa
---
layout: model
title: German Bert Embeddings (from amine)
author: John Snow Labs
name: bert_embeddings_bert_base_5lang_cased
date: 2022-04-11
tags: [bert, embeddings, de, open_source]
task: Embeddings
language: de
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-5lang-cased` is a German model orginally trained by `amine`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_de_3.4.2_3.0_1649676183514.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_de_3.4.2_3.0_1649676183514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Funken NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.embed.bert_base_5lang_cased").predict("""Ich liebe Funken NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_5lang_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|464.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/amine/bert-base-5lang-cased
- https://cloud.google.com/compute/docs/machine-types#n1_machine_type
---
layout: model
title: Translate English to Catalan Pipeline
author: John Snow Labs
name: translate_en_ca
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, ca, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `ca`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ca_xx_2.7.0_2.4_1609686877248.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ca_xx_2.7.0_2.4_1609686877248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_ca", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_ca", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.ca').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_ca|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Indonesian XLMRobertaForTokenClassification Cased model (from vkhangpham)
author: John Snow Labs
name: xlmroberta_ner_shopee
date: 2022-08-13
tags: [id, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: id
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `shopee-ner` is a Indonesian model originally trained by `vkhangpham`.
## Predicted Entities
`STR`, `POI`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_shopee_id_4.1.0_3.0_1660423012861.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_shopee_id_4.1.0_3.0_1660423012861.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_shopee","id") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_shopee","id")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_shopee|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|id|
|Size:|865.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/vkhangpham/shopee-ner
---
layout: model
title: Legal Certain definitions Clause Binary Classifier
author: John Snow Labs
name: legclf_certain_definitions_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `certain-definitions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `certain-definitions`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_certain_definitions_clause_en_1.0.0_3.2_1660122218038.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_certain_definitions_clause_en_1.0.0_3.2_1660122218038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[certain-definitions]|
|[other]|
|[other]|
|[certain-definitions]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_certain_definitions_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
certain-definitions 0.95 0.84 0.89 49
other 0.94 0.99 0.96 138
accuracy - - 0.95 187
macro-avg 0.95 0.91 0.93 187
weighted-avg 0.95 0.95 0.95 187
```
---
layout: model
title: Sinhala BertForQuestionAnswering model (from sankhajay)
author: John Snow Labs
name: bert_qa_bert_base_sinhala_qa
date: 2022-06-02
tags: [si, open_source, question_answering, bert]
task: Question Answering
language: si
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-sinhala-qa` is a Sinhala model orginally trained by `sankhajay`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_sinhala_qa_si_4.0.0_3.0_1654180367412.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_sinhala_qa_si_4.0.0_3.0_1654180367412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_sinhala_qa","si") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_sinhala_qa","si")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("si.answer_question.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_sinhala_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|si|
|Size:|752.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sankhajay/bert-base-sinhala-qa
---
layout: model
title: Legal Forfeitures Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_forfeitures_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, forfeitures, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Forfeitures` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Forfeitures`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_forfeitures_bert_en_1.0.0_3.0_1678046913501.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_forfeitures_bert_en_1.0.0_3.0_1678046913501.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Forfeitures]|
|[Other]|
|[Other]|
|[Forfeitures]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_forfeitures_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Forfeitures 0.91 0.97 0.94 32
Other 0.98 0.94 0.96 50
accuracy - - 0.95 82
macro-avg 0.95 0.95 0.95 82
weighted-avg 0.95 0.95 0.95 82
```
---
layout: model
title: English asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2 TFWav2Vec2ForCTC from gary109
author: John Snow Labs
name: pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2` is a English model originally trained by gary109.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2_en_4.2.0_3.0_1664101430715.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2_en_4.2.0_3.0_1664101430715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English asr_wav2vec2_xls_r_300m_hindi_lm TFWav2Vec2ForCTC from shoubhik
author: John Snow Labs
name: asr_wav2vec2_xls_r_300m_hindi_lm
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_hindi_lm` is a English model originally trained by shoubhik.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_hindi_lm_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_hindi_lm_en_4.2.0_3.0_1664106060093.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_hindi_lm_en_4.2.0_3.0_1664106060093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xls_r_300m_hindi_lm", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xls_r_300m_hindi_lm", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xls_r_300m_hindi_lm|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from deepakvk)
author: John Snow Labs
name: roberta_qa_deepakvk_base_squad2_finetuned_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `deepakvk`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepakvk_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219250943.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepakvk_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219250943.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepakvk_base_squad2_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepakvk_base_squad2_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_deepakvk_base_squad2_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/deepakvk/roberta-base-squad2-finetuned-squad
---
layout: model
title: English T5ForConditionalGeneration Cased model (from cometrain)
author: John Snow Labs
name: t5_fake_news_detector
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fake-news-detector-t5` is a English model originally trained by `cometrain`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_fake_news_detector_en_4.3.0_3.0_1675101857981.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_fake_news_detector_en_4.3.0_3.0_1675101857981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_fake_news_detector","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_fake_news_detector","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_fake_news_detector|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|277.3 MB|
## References
- https://huggingface.co/cometrain/fake-news-detector-t5
- https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from khoanvm)
author: John Snow Labs
name: roberta_qa_base_squad2_finetuned_visquad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-visquad` is a English model originally trained by `khoanvm`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_finetuned_visquad_en_4.3.0_3.0_1674219613502.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_finetuned_visquad_en_4.3.0_3.0_1674219613502.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_finetuned_visquad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_finetuned_visquad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_squad2_finetuned_visquad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/khoanvm/roberta-base-squad2-finetuned-visquad
---
layout: model
title: English DistilBertForQuestionAnswering model (from anurag0077) Squad2
author: John Snow Labs
name: distilbert_qa_anurag0077_base_uncased_finetuned_squad2
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `anurag0077`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_anurag0077_base_uncased_finetuned_squad2_en_4.0.0_3.0_1654726811228.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_anurag0077_base_uncased_finetuned_squad2_en_4.0.0_3.0_1654726811228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anurag0077_base_uncased_finetuned_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anurag0077_base_uncased_finetuned_squad2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.distil_bert.base_uncased.by_anurag0077").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_anurag0077_base_uncased_finetuned_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anurag0077/distilbert-base-uncased-finetuned-squad2
---
layout: model
title: English RoBERTa Embeddings (Smiles Strings, v2)
author: John Snow Labs
name: roberta_embeddings_chEMBL26_smiles_v2
date: 2022-04-14
tags: [roberta, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chEMBL26_smiles_v2` is a English model orginally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_chEMBL26_smiles_v2_en_3.4.2_3.0_1649946865988.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_chEMBL26_smiles_v2_en_3.4.2_3.0_1649946865988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_chEMBL26_smiles_v2","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_chEMBL26_smiles_v2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.chEMBL26_smiles_v2").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_chEMBL26_smiles_v2|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|90.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/mrm8488/chEMBL26_smiles_v2
---
layout: model
title: Legal Costs Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_costs_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, costs, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Costs` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Costs`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_costs_bert_en_1.0.0_3.0_1678049894274.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_costs_bert_en_1.0.0_3.0_1678049894274.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Costs]|
|[Other]|
|[Other]|
|[Costs]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_costs_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Costs 0.81 1.00 0.90 13
Other 1.00 0.88 0.94 26
accuracy - - 0.92 39
macro-avg 0.91 0.94 0.92 39
weighted-avg 0.94 0.92 0.92 39
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Tsonga
author: John Snow Labs
name: opus_mt_en_ts
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ts, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ts`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ts_xx_2.7.0_2.4_1609170044688.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ts_xx_2.7.0_2.4_1609170044688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ts", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ts", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ts').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ts|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: NER Model Finder with Sentence Entity Resolvers (sbert_jsl_medium_uncased)
author: John Snow Labs
name: sbertresolve_ner_model_finder
date: 2022-09-05
tags: [en, entity_resolver, licensed, ner, clinical]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities (NER labels) to the most appropriate NER model using `sbert_jsl_medium_uncased` Sentence Bert Embeddings. Given the entity name, it will return a list of pretrained NER models having that entity or similar ones.
## Predicted Entities
`ner_model_list`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_4.1.0_3.0_1662377743401.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_4.1.0_3.0_1662377743401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased","en","clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
ner_model_finder = SentenceEntityResolverModel.pretrained("sbertresolve_ner_model_finder", "en", "clinical/models")\
.setInputCols(["sbert_embeddings"])\
.setOutputCol("model_names")\
.setDistanceFunction("EUCLIDEAN")
ner_model_finder_pipelineModel = PipelineModel(stages = [documentAssembler, sbert_embedder, ner_model_finder])
light_pipeline = LightPipeline(ner_model_finder_pipelineModel)
annotations = light_pipeline.fullAnnotate("medication")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased","en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val ner_model_finder = SentenceEntityResolverModel.pretrained("sbertresolve_ner_model_finder", "en", "clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("model_names")
.setDistanceFunction("EUCLIDEAN")
val ner_model_finder_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, ner_model_finder))
val light_pipeline = LightPipeline(ner_model_finder_pipelineModel)
val annotations = light_pipeline.fullAnnotate("medication")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.ner.model_finder").predict("""Put your text here.""")
```
## Results
```bash
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity |models |all_models |resolutions |
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|medication|['ner_posology_greedy', 'jsl_ner_wip_modifier_clinical', 'ner_posology_small', 'ner_jsl_greedy', 'ner_ade_clinical', 'ner_posology', 'ner_risk_factors', 'ner_ade_healthcare', 'ner_drugs_large', 'ner_jsl_slim', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_drugs_greedy', 'ner_pathogen']|['ner_posology_greedy', 'jsl_ner_wip_modifier_clinical', 'ner_posology_small', 'ner_jsl_greedy', 'ner_ade_clinical', 'ner_posology', 'ner_risk_factors', 'ner_ade_healthcare', 'ner_drugs_large', 'ner_jsl_slim', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_drugs_greedy', 'ner_pathogen']:::['ner_posology_greedy', 'jsl_ner_wip_modifier_clinical', 'ner_posology_small', 'ner_jsl_greedy', 'ner_ade_clinical', 'ner_nature_nero_clinical', 'ner_posology', 'ner_biomarker', 'ner_clinical_trials_abstracts', 'ner_risk_factors', 'ner_ade_healthcare', 'ner_drugs_large', 'ner_jsl_slim', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_drugs_greedy']:::['ner_covid_trials', 'ner_jsl', 'jsl_rd_ner_wip_greedy_clinical', 'jsl_ner_wip_modifier_clinical', 'ner_healthcare', 'ner_jsl_enriched', 'ner_events_clinical', 'ner_jsl_greedy', 'ner_clinical', 'ner_clinical_large', 'ner_jsl_slim', 'ner_events_healthcare', 'ner_events_admission_clinical']:::['ner_biomarker']:::['ner_medmentions_coarse']:::['ner_covid_trials', 'ner_jsl_enriched', 'ner_jsl', 'ner_medmentions_coarse']:::['ner_drugs']:::['ner_nature_nero_clinical']:::['ner_jsl', 'jsl_rd_ner_wip_greedy_clinical', 'jsl_ner_wip_modifier_clinical', 'ner_medmentions_coarse', 'ner_jsl_enriched', 'ner_jsl_greedy']:::['ner_jsl', 'jsl_rd_ner_wip_greedy_clinical', 'ner_nature_nero_clinical', 'ner_medmentions_coarse', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_enriched', 'ner_radiology_wip_clinical', 'ner_jsl_greedy', 'ner_radiology', 'ner_jsl_slim']:::['ner_posology_experimental']:::['ner_pathogen']:::['ner_measurements_clinical', 'jsl_rd_ner_wip_greedy_clinical', 'ner_nature_nero_clinical', 'ner_radiology_wip_clinical', 'ner_radiology', 'ner_nihss']:::['ner_jsl', 'ner_posology_greedy', 'jsl_rd_ner_wip_greedy_clinical', 'jsl_ner_wip_modifier_clinical', 'ner_posology_small', 'ner_jsl_enriched', 'ner_jsl_greedy', 'ner_posology', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare']:::['ner_covid_trials', 'ner_jsl', 'jsl_rd_ner_wip_greedy_clinical', 'ner_medmentions_coarse', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_enriched', 'ner_jsl_greedy']:::['ner_clinical_trials_abstracts']:::['ner_medmentions_coarse', 'ner_nature_nero_clinical']|medication:::drug:::treatment:::targeted therapy:::therapeutic procedure:::drug ingredient:::drug chemical:::medical procedure:::substance:::medical device:::administration:::medical condition:::measurement:::drug strength:::physiological reaction:::dose:::research activity|
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbertresolve_ner_model_finder|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sbert_embeddings]|
|Output Labels:|[models]|
|Language:|en|
|Size:|737.3 KB|
|Case sensitive:|false|
## References
This model is trained with the data that has the labels of 70 different clinical NER models.
---
layout: model
title: Norwegian BertForMaskedLM Cased model (from ltgoslo)
author: John Snow Labs
name: bert_embeddings_norbert2
date: 2022-12-02
tags: ["no", open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: "no"
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `norbert2` is a Norwegian model originally trained by `ltgoslo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert2_no_4.2.4_3.0_1670022783195.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert2_no_4.2.4_3.0_1670022783195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert2","no") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert2","no")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_norbert2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|no|
|Size:|467.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/ltgoslo/norbert2
- http://vectors.nlpl.eu/repository/20/221.zip
- http://norlm.nlpl.eu/
- https://github.com/ltgoslo/NorBERT
- https://aclanthology.org/2021.nodalida-main.4/
- https://www.eosc-nordic.eu/
- https://www.mn.uio.no/ifi/english/research/groups/ltg/
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_dm2000
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dm2000` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm2000_en_4.3.0_3.0_1675119184723.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm2000_en_4.3.0_3.0_1675119184723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_dm2000","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_dm2000","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_dm2000|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|590.0 MB|
## References
- https://huggingface.co/google/t5-efficient-small-dm2000
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English asr_wav2vec2_base_timit_moaiz_exp2 TFWav2Vec2ForCTC from moaiz237
author: John Snow Labs
name: asr_wav2vec2_base_timit_moaiz_exp2
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_moaiz_exp2` is a English model originally trained by moaiz237.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_moaiz_exp2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_moaiz_exp2_en_4.2.0_3.0_1664037589335.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_moaiz_exp2_en_4.2.0_3.0_1664037589335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_moaiz_exp2", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_moaiz_exp2", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_moaiz_exp2|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from nlpconnect)
author: John Snow Labs
name: roberta_qa_dpr_nq_reader_base_v2
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dpr-nq-reader-roberta-base-v2` is a English model originally trained by `nlpconnect`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_base_v2_en_4.3.0_3.0_1674210757247.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_base_v2_en_4.3.0_3.0_1674210757247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_base_v2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_base_v2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_dpr_nq_reader_base_v2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|466.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/nlpconnect/dpr-nq-reader-roberta-base-v2
---
layout: model
title: Detect concepts in drug development trials (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_drug_development_trials
date: 2021-12-17
tags: [en, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.3.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
It is a `BertForTokenClassification` NER model to identify concepts related to drug development including `Trial Groups` , `End Points` , `Hazard Ratio`, and other entities in free text.
## Predicted Entities
`Patient_Count`, `Duration`, `End_Point`, `Value`, `Trial_Group`, `Hazard_Ratio`, `Total_Patients'`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DRUGS_DEVELOPMENT_TRIALS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_en_3.3.2_3.0_1639776838533.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_en_3.3.2_3.0_1639776838533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk")
p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))
test_sentence = """In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan."""
result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]})))
```
```scala
...
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")
.setInputCols("token", "document")
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))
val data = Seq("""In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.ner.drug_development_trials").predict("""In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""")
```
## Results
```bash
| | chunk | entity |
|---:|:------------------|:--------------|
| 0 | median | Duration |
| 1 | overall survival | End_Point |
| 2 | with | Trial_Group |
| 3 | without topotecan | Trial_Group |
| 4 | 4.0 | Value |
| 5 | 3.6 months | Value |
| 6 | 23 | Patient_Count |
| 7 | 63 | Patient_Count |
| 8 | 55 | Patient_Count |
| 9 | 33 patients | Patient_Count |
| 10 | topotecan | Trial_Group |
| 11 | 11 | Patient_Count |
| 12 | 61 | Patient_Count |
| 13 | 66 | Patient_Count |
| 14 | 32 patients | Patient_Count |
| 15 | without topotecan | Trial_Group |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_drug_development_trials|
|Compatibility:|Healthcare NLP 3.3.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|400.6 MB|
|Case sensitive:|true|
|Max sentense length:|256|
## Data Source
Trained on data obtained from `clinicaltrials.gov` and annotated in-house.
## Benchmarking
```bash
label precision recall f1 support
B-Duration 0.93 0.94 0.93 1820
B-End_Point 0.99 0.98 0.98 5022
B-Hazard_Ratio 0.97 0.95 0.96 778
B-Patient_Count 0.81 0.88 0.85 300
B-Trial_Group 0.86 0.88 0.87 6751
B-Value 0.94 0.96 0.95 7675
I-Duration 0.71 0.82 0.76 185
I-End_Point 0.94 0.98 0.96 1491
I-Patient_Count 0.48 0.64 0.55 44
I-Trial_Group 0.78 0.75 0.77 4561
I-Value 0.93 0.95 0.94 1511
O 0.96 0.95 0.95 47423
accuracy - - 0.94 77608
macro-avg 0.79 0.82 0.80 77608
weighted-avg 0.94 0.94 0.94 77608
```
---
layout: model
title: English RobertaForQuestionAnswering Tiny Cased model (from deepset)
author: John Snow Labs
name: roberta_qa_tiny_6l_768d
date: 2022-12-02
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinyroberta-6l-768d` is a English model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_6l_768d_en_4.2.4_3.0_1669988517909.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_6l_768d_en_4.2.4_3.0_1669988517909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_6l_768d","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_6l_768d","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_tiny_6l_768d|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|307.2 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/deepset/tinyroberta-6l-768d
- https://arxiv.org/pdf/1909.10351.pdf
- https://github.com/deepset-ai/haystack
- https://haystack.deepset.ai/guides/model-distillation
- https://github.com/deepset-ai/haystack/
- https://workablehr.s3.amazonaws.com/uploads/account/logo/476306/logo
- https://deepset.ai/german-bert
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/FARM
- https://github.com/deepset-ai/haystack/
- https://twitter.com/deepset_ai
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- http://www.deepset.ai/jobs
---
layout: model
title: Legal Joint Filing Agreement Document Classifier (Longformer)
author: John Snow Labs
name: legclf_joint_filing_agreement
date: 2022-11-24
tags: [en, legal, classification, agreement, joint_filing, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_joint_filing_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `joint-filing-agreement` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`joint-filing-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_joint_filing_agreement_en_1.0.0_3.0_1669291473829.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_joint_filing_agreement_en_1.0.0_3.0_1669291473829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[joint-filing-agreement]|
|[other]|
|[other]|
|[joint-filing-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_joint_filing_agreement|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
joint-filing-agreement 0.97 0.97 0.97 31
other 0.99 0.99 0.99 90
accuracy - - 0.98 121
macro avg 0.98 0.98 0.98 121
weighted avg 0.98 0.98 0.98 121
```
---
layout: model
title: Translate Lozi to English Pipeline
author: John Snow Labs
name: translate_loz_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, loz, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `loz`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_loz_en_xx_2.7.0_2.4_1609698759083.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_loz_en_xx_2.7.0_2.4_1609698759083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_loz_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_loz_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.loz.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_loz_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Fast Neural Machine Translation Model from English to Dutch
author: John Snow Labs
name: opus_mt_en_nl
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, nl, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `nl`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_nl_xx_2.7.0_2.4_1609164726700.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_nl_xx_2.7.0_2.4_1609164726700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_nl", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_nl", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.nl').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_nl|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Clinical English Bert Embeddings (Base, 512 dimension)
author: John Snow Labs
name: bert_embeddings_clinical_pubmed_bert_base_512
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `clinical-pubmed-bert-base-512` is a English model orginally trained by `Tsubasaz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_clinical_pubmed_bert_base_512_en_3.4.2_3.0_1649672313480.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_clinical_pubmed_bert_base_512_en_3.4.2_3.0_1649672313480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_clinical_pubmed_bert_base_512","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_clinical_pubmed_bert_base_512","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.clinical_pubmed_bert_base_512").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_clinical_pubmed_bert_base_512|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|410.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Tsubasaz/clinical-pubmed-bert-base-512
- https://mimic.physionet.org/
---
layout: model
title: English RoBERTa Embeddings (Base, Wikipedia and Bookcorpus datasets)
author: John Snow Labs
name: roberta_embeddings_muppet_roberta_base
date: 2022-04-14
tags: [roberta, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muppet-roberta-base` is a English model orginally trained by `facebook`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_muppet_roberta_base_en_3.4.2_3.0_1649946369947.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_muppet_roberta_base_en_3.4.2_3.0_1649946369947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_muppet_roberta_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_muppet_roberta_base","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.muppet_roberta_base").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_muppet_roberta_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|301.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/facebook/muppet-roberta-base
- https://arxiv.org/abs/2101.11038
---
layout: model
title: Fast Neural Machine Translation Model from Kabyle to English
author: John Snow Labs
name: opus_mt_kab_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, kab, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `kab`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_kab_en_xx_2.7.0_2.4_1609166904449.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_kab_en_xx_2.7.0_2.4_1609166904449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_kab_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_kab_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.kab.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_kab_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Moldavian, Moldovan, Romanian asr_wav2vec2_large_xlsr_53_romanian_by_anton_l TFWav2Vec2ForCTC from anton-l
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l
date: 2022-09-25
tags: [wav2vec2, ro, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: ro
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_romanian_by_anton_l` is a Moldavian, Moldovan, Romanian model originally trained by anton-l.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_ro_4.2.0_3.0_1664098754607.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_ro_4.2.0_3.0_1664098754607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l', lang = 'ro')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l", lang = "ro")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|ro|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_nh1
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nh1` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh1_en_4.3.0_3.0_1675123623466.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh1_en_4.3.0_3.0_1675123623466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_nh1","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_nh1","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_nh1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|41.6 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-nh1
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English RobertaForQuestionAnswering (from thatdramebaazguy)
author: John Snow Labs
name: roberta_qa_roberta_base_MITmovie_squad
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-MITmovie-squad` is a English model originally trained by `thatdramebaazguy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_MITmovie_squad_en_4.0.0_3.0_1655729784944.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_MITmovie_squad_en_4.0.0_3.0_1655729784944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_MITmovie_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_MITmovie_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.movie_squad.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_MITmovie_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|461.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/thatdramebaazguy/roberta-base-MITmovie-squad
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from sunitha)
author: John Snow Labs
name: roberta_qa_cv_custom_ds
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `CV_Custom_DS` is a English model originally trained by `sunitha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cv_custom_ds_en_4.3.0_3.0_1674207905368.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cv_custom_ds_en_4.3.0_3.0_1674207905368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cv_custom_ds","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cv_custom_ds","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_cv_custom_ds|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/sunitha/CV_Custom_DS
---
layout: model
title: Pipeline to Extraction of Clinical Abbreviations and Acronyms
author: John Snow Labs
name: ner_abbreviation_clinical_pipeline
date: 2023-03-14
tags: [ner, abbreviation, acronym, en, clinical, licensed]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_abbreviation_clinical](https://nlp.johnsnowlabs.com/2021/12/30/ner_abbreviation_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_clinical_pipeline_en_4.3.0_3.2_1678777406281.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_clinical_pipeline_en_4.3.0_3.2_1678777406281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_abbreviation_clinical_pipeline", "en", "clinical/models")
text = '''Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_abbreviation_clinical_pipeline", "en", "clinical/models")
val text = "Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.clinical-abbreviation.pipeline").predict("""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""")
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:-------------|--------:|------:|:------------|-------------:|
| 0 | CBC | 126 | 128 | ABBR | 1 |
| 1 | AB | 159 | 160 | ABBR | 1 |
| 2 | VDRL | 189 | 192 | ABBR | 1 |
| 3 | HIV | 247 | 249 | ABBR | 1 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_abbreviation_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from google)
author: John Snow Labs
name: t5_efficient_base_nl4
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nl4` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl4_en_4.3.0_3.0_1675113874908.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl4_en_4.3.0_3.0_1675113874908.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_nl4","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_nl4","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_nl4|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|221.4 MB|
## References
- https://huggingface.co/google/t5-efficient-base-nl4
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Pipeline to Detect Clinical Entities (WIP)
author: John Snow Labs
name: jsl_ner_wip_clinical_pipeline
date: 2022-03-21
tags: [licensed, ner, wip, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [jsl_ner_wip_clinical](https://nlp.johnsnowlabs.com/2021/03/31/jsl_ner_wip_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_pipeline_en_3.4.1_3.0_1647865732108.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_pipeline_en_3.4.1_3.0_1647865732108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("jsl_ner_wip_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
```scala
val pipeline = new PretrainedPipeline("jsl_ner_wip_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.jsl_wip_clinical.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
+-----------------------------------------+----------------------------+
|chunk |ner_label |
+-----------------------------------------+----------------------------+
|21-day-old |Age |
|Caucasian |Race_Ethnicity |
|male |Gender |
|for 2 days |Duration |
|congestion |Symptom |
|mom |Gender |
|yellow |Modifier |
|discharge |Symptom |
|nares |External_body_part_or_region|
|she |Gender |
|mild |Modifier |
|problems with his breathing while feeding|Symptom |
|perioral cyanosis |Symptom |
|retractions |Symptom |
|One day ago |RelativeDate |
|mom |Gender |
|Tylenol |Drug_BrandName |
|Baby |Age |
|decreased p.o. intake |Symptom |
|His |Gender |
+-----------------------------------------+----------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|jsl_ner_wip_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Part of Speech for Bengali (pos_msri)
author: John Snow Labs
name: pos_msri
date: 2021-01-20
task: Part of Speech Tagging
language: bn
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [bn, pos, open_source]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include NN (noun), CC (Conjuncts - coordinating and subordinating), and 26 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
## Predicted Entities
`BM` (Not Documented), `CC (Conjuncts, Coordinating and Subordinating)`, `CL (Clitics)`, `DEM (Demonstratives)`, `INJ (Interjection)`, `INTF (Intensifier)`, `JJ (Adjective)`, `NEG (Negative)`, `NN (Noun)`, `NNC (Compound Nouns)`, `NNP (Proper Noun)`, `NST (Preposition of Direction)`, `PPR (Postposition)`, `PRP (Pronoun)`, `PSP (Preprosition)`, `QC (Cardinal Number)`, `QF (Quantifiers)`, `QO (Ordinal Numbers)`, `RB (Adverb)`, `RDP (Not Documented)`, `RP (Particle)`, `SYM (Special Symbol)`, `UT (Not Documented)`, `VAUX (Verb Auxiliary)`, `VM (Verb)`, `WQ (wh- qualifier)`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_msri_bn_2.7.0_2.4_1611173659719.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_msri_bn_2.7.0_2.4_1611173659719.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_msri", "bn") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
posTagger
])
example = spark.createDataFrame([["বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে মোদ ' ৷"]], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_lst20", "th")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷"]
pos_df = nlu.load('bn.pos').predict(text, output_level = "token")
pos_df
```
## Results
```bash
+------------------------------------------------------+----------------------------------------+
|text |result |
+------------------------------------------------------+----------------------------------------+
|বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷|[NN, NNP, NN, NN, VM, SYM, NN, SYM, SYM]|
+------------------------------------------------------+----------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_msri|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[pos]|
|Language:|bn|
## Data Source
The model was trained on the _Indian Language POS-Tagged Corpus_ from [NLTK](http://www.nltk.org) collected by A Kumaran (Microsoft Research, India).
## Benchmarking
```bash
| | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| BM | 1.00 | 1.00 | 1.00 | 1 |
| CC | 0.99 | 0.99 | 0.99 | 390 |
| CL | 1.00 | 1.00 | 1.00 | 2 |
| DEM | 0.98 | 0.99 | 0.98 | 139 |
| INJ | 0.92 | 0.85 | 0.88 | 13 |
| INTF | 1.00 | 1.00 | 1.00 | 55 |
| JJ | 0.99 | 0.99 | 0.99 | 688 |
| NEG | 0.99 | 0.98 | 0.99 | 135 |
| NN | 0.99 | 0.99 | 0.99 | 2996 |
| NNC | 1.00 | 1.00 | 1.00 | 4 |
| NNP | 0.97 | 0.98 | 0.97 | 528 |
| NST | 1.00 | 1.00 | 1.00 | 156 |
| PPR | 1.00 | 1.00 | 1.00 | 1 |
| PRP | 0.98 | 0.98 | 0.98 | 685 |
| PSP | 0.99 | 0.99 | 0.99 | 250 |
| QC | 0.99 | 0.99 | 0.99 | 193 |
| QF | 0.98 | 0.98 | 0.98 | 187 |
| QO | 1.00 | 1.00 | 1.00 | 22 |
| RB | 0.99 | 0.99 | 0.99 | 187 |
| RDP | 1.00 | 0.98 | 0.99 | 44 |
| RP | 0.99 | 0.96 | 0.97 | 79 |
| SYM | 0.97 | 0.98 | 0.98 | 1413 |
| UNK | 1.00 | 1.00 | 1.00 | 1 |
| UT | 1.00 | 1.00 | 1.00 | 18 |
| VAUX | 0.97 | 0.97 | 0.97 | 400 |
| VM | 0.99 | 0.98 | 0.98 | 1393 |
| WQ | 1.00 | 0.99 | 0.99 | 71 |
| XC | 0.98 | 0.97 | 0.97 | 219 |
| accuracy | | | 0.98 | 10270 |
| macro avg | 0.99 | 0.98 | 0.99 | 10270 |
| weighted avg | 0.98 | 0.98 | 0.98 | 10270 |
```
---
layout: model
title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbertresolve_icd10cm_slim_billable_hcc_med)
author: John Snow Labs
name: sbertresolve_icd10cm_slim_billable_hcc_med
date: 2021-05-25
tags: [icd10cm, licensed, slim, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts to ICD10 CM codes using sentence bert embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped. It also returns the official resolution text within the brackets inside the metadata. The model is augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths).
## Predicted Entities
Outputs 7-digit billable ICD codes. In the result, look for aux_label parameter in the metadata to get HCC status. The HCC status can be divided to get further information: billable status, hcc status, and hcc score.For example, in the example shared below the billable status is 1, hcc status is 1, and hcc score is 11.
{:.btn-box}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_med_en_3.0.3_2.4_1621977523869.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_med_en_3.0.3_2.4_1621977523869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings\
.pretrained('sbert_jsl_medium_uncased', 'en','clinical/models')\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN").setReturnCosineDistances(True)
bert_pipeline_icd = PipelineModel(stages = [document_assembler, sbert_embedder, icd10_resolver])
model = bert_pipeline_icd.fit(spark.createDataFrame([["bladder cancer"]]).toDF("text"))
results = model.transform(data)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")
.setInputCols("document")
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models")
.setInputCols(["sbert_embeddings"])
.setOutputCol("icd10cm_code")
.setDistanceFunction("EUCLIDEAN")
.setReturnCosineDistances(True)
val bert_pipeline_icd = new PipelineModel().setStages(Array(document_assembler, sbert_embedder, icd10_resolver))
val data = Seq("bladder cancer").toDS.toDF("text")
val result = bert_pipeline_icd.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10cm.slim_billable_hcc_med").predict("""bladder cancer""")
```
## Results
```bash
| | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances |
|---:|:---------------|:--------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------------:|:----------------------------|:-----------------------------------------------------------------------------------------------------------------|
| 0 | bladder cancer | C671 |[bladder cancer, dome [Malignant neoplasm of dome of bladder], cancer of the urinary bladder [Malignant neoplasm of bladder, unspecified], prostate cancer [Malignant neoplasm of prostate], cancer of the urinary bladder, lateral wall [Malignant neoplasm of lateral wall of bladder], cancer of the urinary bladder, anterior wall [Malignant neoplasm of anterior wall of bladder], cancer of the urinary bladder, posterior wall [Malignant neoplasm of posterior wall of bladder], cancer of the urinary bladder, neck [Malignant neoplasm of bladder neck], cancer of the urinary bladder, ureteric orifice [Malignant neoplasm of ureteric orifice]]| [C671, C679, C61, C672, C673, C674, C675, C676, D090, Z126, D494, C670, Z8551, C7911] | ['1', '1', '11'] | [0.0894, 0.1051, 0.1184, 0.1180, 0.1200, 0.1204, 0.1255, 0.1375, 0.1357, 0.1452, 0.1469, 0.1513, 0.1500, 0.1575] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbertresolve_icd10cm_slim_billable_hcc_med|
|Compatibility:|Healthcare NLP 3.0.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk, sbert_embeddings]|
|Output Labels:|[icd10_code]|
|Language:|en|
|Case sensitive:|false|
---
layout: model
title: Swedish asr_lm_swedish TFWav2Vec2ForCTC from birgermoell
author: John Snow Labs
name: asr_lm_swedish
date: 2022-09-25
tags: [wav2vec2, sv, audio, open_source, asr]
task: Automatic Speech Recognition
language: sv
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_lm_swedish` is a Swedish model originally trained by birgermoell.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_lm_swedish_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_lm_swedish_sv_4.2.0_3.0_1664117876808.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_lm_swedish_sv_4.2.0_3.0_1664117876808.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_lm_swedish", "sv")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_lm_swedish", "sv")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_lm_swedish|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|sv|
|Size:|757.4 MB|
---
layout: model
title: Translate English to Gun Pipeline
author: John Snow Labs
name: translate_en_guw
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, guw, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `guw`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_guw_xx_2.7.0_2.4_1609688347625.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_guw_xx_2.7.0_2.4_1609688347625.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_guw", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_guw", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.guw').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_guw|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from evegarcianz)
author: John Snow Labs
name: distilbert_qa_evegarcianz_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `evegarcianz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_evegarcianz_finetuned_squad_en_4.3.0_3.0_1672765810295.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_evegarcianz_finetuned_squad_en_4.3.0_3.0_1672765810295.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_evegarcianz_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_evegarcianz_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_evegarcianz_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/evegarcianz/bert-finetuned-squad
---
layout: model
title: Legal NER on EDGAR Documents
author: John Snow Labs
name: legner_sec_edgar
date: 2023-04-13
tags: [en, licensed, legal, ner, sec, edgar]
task: Named Entity Recognition
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Legal NER model extracts `ORG`, `INST`, `LAW`, `COURT`, `PER`, `LOC`, `MISC`, `ALIAS`, and `TICKER` entities from the US SEC EDGAR documents.
## Predicted Entities
`ALIAS`, `COURT`, `INST`, `LAW`, `LOC`, `MISC`, `ORG`, `PER`, `TICKER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_sec_edgar_en_1.0.0_3.0_1681397579002.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_sec_edgar_en_1.0.0_3.0_1681397579002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_sec_edgar", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""In our opinion, the accompanying consolidated balance sheets and the related consolidated statements of operations, of changes in stockholders' equity, and of cash flows present fairly, in all material respects, the financial position of SunGard Capital Corp. II and its subsidiaries ( SCC II ) at December 31, 2010, and 2009, and the results of their operations and their cash flows for each of the three years in the period ended December 31, 2010, in conformity with accounting principles generally accepted in the United States of America."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+----------------------------------------+---------+
|chunk |ner_label|
+----------------------------------------+---------+
|SunGard Capital Corp. II |ORG |
|SCC II |ALIAS |
|accounting principles generally accepted|LAW |
|United States of America |LOC |
+----------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_sec_edgar|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.3 MB|
## References
In-house annotations
## Benchmarking
```bash
label precision recall f1-score support
ALIAS 0.86 0.74 0.79 84
COURT 0.86 1.00 0.92 6
INST 0.94 0.76 0.84 76
LAW 0.91 0.93 0.92 166
LOC 0.89 0.88 0.88 140
MISC 0.90 0.83 0.86 226
ORG 0.89 0.93 0.91 430
PER 0.92 0.92 0.92 66
TICKER 1.00 0.86 0.92 7
micro-avg 0.90 0.88 0.89 1201
macro-avg 0.91 0.87 0.89 1201
weighted-avg 0.90 0.88 0.89 1201
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from andi611) Squad2 with Neg, Repeat
author: John Snow Labs
name: distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2-with-ner-with-neg-with-repeat` is a English model originally trained by `andi611`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat_en_4.0.0_3.0_1654727466812.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat_en_4.0.0_3.0_1654727466812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_conll.distil_bert.base_uncased_with_neg_with_repeat.by_andi611").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/andi611/distilbert-base-uncased-squad2-with-ner-with-neg-with-repeat
---
layout: model
title: Legal Independent contractor Clause Binary Classifier
author: John Snow Labs
name: legclf_independent_contractor_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `independent-contractor` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `independent-contractor`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_independent_contractor_clause_en_1.0.0_3.2_1660122527352.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_independent_contractor_clause_en_1.0.0_3.2_1660122527352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[independent-contractor]|
|[other]|
|[other]|
|[independent-contractor]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_independent_contractor_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
independent-contractor 1.00 1.00 1.00 34
other 1.00 1.00 1.00 101
accuracy - - 1.00 135
macro-avg 1.00 1.00 1.00 135
weighted-avg 1.00 1.00 1.00 135
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from guhuawuli)
author: John Snow Labs
name: distilbert_qa_guhuawuli_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `guhuawuli`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_guhuawuli_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770918999.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_guhuawuli_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770918999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_guhuawuli_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_guhuawuli_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_guhuawuli_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/guhuawuli/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English ALBERT Embeddings (xx-large)
author: John Snow Labs
name: albert_embeddings_albert_xxlarge_v1
date: 2022-04-14
tags: [albert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: AlBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-xxlarge-v1` is a English model orginally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_xxlarge_v1_en_3.4.2_3.0_1649954172408.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_xxlarge_v1_en_3.4.2_3.0_1649954172408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_xxlarge_v1","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_xxlarge_v1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.albert_xxlarge_v1").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_embeddings_albert_xxlarge_v1|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|834.9 MB|
|Case sensitive:|false|
## References
- https://huggingface.co/albert-xxlarge-v1
- https://arxiv.org/abs/1909.11942
- https://github.com/google-research/albert
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
---
layout: model
title: Relation Extraction Between Body Parts and Direction Entities (ReDL)
author: John Snow Labs
name: redl_bodypart_direction_biobert
date: 2023-01-14
tags: [licensed, en, clinical, relation_extraction, tensorflow]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Relation extraction between body parts entities like Internal_organ_or_component, External_body_part_or_region etc. and direction entities like upper, lower in clinical texts. 1 : Shows the body part and direction entity are related, 0 : Shows the body part and direction entity are not related.
## Predicted Entities
`1`, `0`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_direction_biobert_en_4.2.4_3.0_1673710170047.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_direction_biobert_en_4.2.4_3.0_1673710170047.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = NerConverterInternal() \
.setInputCols(["sentences", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentences", "pos_tags", "tokens"]) \
.setOutputCol("dependencies")
# Set a filter on pairs of named entities which will be treated as relation candidates
re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setMaxSyntacticDistance(10)\
.setOutputCol("re_ner_chunks")\
.setRelationPairs(['direction-external_body_part_or_region',
'external_body_part_or_region-direction',
'direction-internal_organ_or_component',
'internal_organ_or_component-direction'
])
# The dataset this model is trained to is sentence-wise.
# This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
re_model = RelationExtractionDLModel()\
.pretrained('redl_bodypart_direction_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model])
data = spark.createDataFrame([[''' MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia ''']]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
.setRelationPairs(Array("direction-external_body_part_or_region",
"external_body_part_or_region-direction",
"direction-internal_organ_or_component",
"internal_organ_or_component-direction"))
// The dataset this model is trained to is sentence-wise.
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val re_model = RelationExtractionDLModel()
.pretrained("redl_bodypart_direction_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model))
val data = Seq("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation").predict(""" MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia """)
```
## Results
```bash
| index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence |
|-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------|
| 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 |
| 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 |
| 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 |
| 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 |
| 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 |
| 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 |
| 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 |
| 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 |
| 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_bodypart_direction_biobert|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|401.7 MB|
## References
Trained on an internal dataset.
## Benchmarking
```bash
label Recall Precision F1 Support
0 0.856 0.873 0.865 153
1 0.986 0.984 0.985 1347
Avg. 0.921 0.929 0.925 -
```
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from allenai)
author: John Snow Labs
name: t5_unifiedqa_v2_base_1363200
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unifiedqa-v2-t5-base-1363200` is a English model originally trained by `allenai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_unifiedqa_v2_base_1363200_en_4.3.0_3.0_1675157943693.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_unifiedqa_v2_base_1363200_en_4.3.0_3.0_1675157943693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_unifiedqa_v2_base_1363200","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_unifiedqa_v2_base_1363200","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_unifiedqa_v2_base_1363200|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|474.3 MB|
## References
- https://huggingface.co/allenai/unifiedqa-v2-t5-base-1363200
- #further-details-httpsgithubcomallenaiunifiedqa
- https://github.com/allenai/unifiedqa
- #further-details-httpsgithubcomallenaiunifiedqa
- https://github.com/allenai/unifiedqa
---
layout: model
title: Stop Words Cleaner for Finnish
author: John Snow Labs
name: stopwords_fi
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: fi
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, fi]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_fi_fi_2.5.4_2.4_1594742441054.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_fi_fi_2.5.4_2.4_1594742441054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_fi", "fi") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_fi", "fi")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä."""]
stopword_df = nlu.load('fi.stopwords').predict(text)
stopword_df[["cleanTokens"]]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=11, end=11, result=',', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=25, end=33, result='pohjoisen', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=35, end=42, result='kuningas', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=43, end=43, result=',', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=45, end=48, result='John', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_fi|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|fi|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: Stop Words Cleaner for Thai
author: John Snow Labs
name: stopwords_th
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: th
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, th]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_th_th_2.5.4_2.4_1594742440606.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_th_th_2.5.4_2.4_1594742440606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_th", "th") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("นอกเหนือจากการเป็นราชาแห่งทิศเหนือแล้วจอห์นสโนว์ยังเป็นแพทย์ชาวอังกฤษและเป็นผู้นำในการพัฒนายาระงับความรู้สึกและสุขอนามัยทางการแพทย์")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_th", "th")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("นอกเหนือจากการเป็นราชาแห่งทิศเหนือแล้วจอห์นสโนว์ยังเป็นแพทย์ชาวอังกฤษและเป็นผู้นำในการพัฒนายาระงับความรู้สึกและสุขอนามัยทางการแพทย์").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""นอกเหนือจากการเป็นราชาแห่งทิศเหนือแล้วจอห์นสโนว์ยังเป็นแพทย์ชาวอังกฤษและเป็นผู้นำในการพัฒนายาระงับความรู้สึกและสุขอนามัยทางการแพทย์"""]
stopword_df = nlu.load('th.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=130, result='นอกเหนือจากการเป็นราชาแห่งทิศเหนือแล้วจอห์นสโนว์ยังเป็นแพทย์ชาวอังกฤษและเป็นผู้นำในการพัฒนายาระงับความรู้สึกและสุขอนามัยทางการแพทย์', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_th|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|th|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: Legal European Construction Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_european_construction_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, european_construction, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_european_construction_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class European_Construction or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`European_Construction`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_european_construction_bert_en_1.0.0_3.0_1678111732690.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_european_construction_bert_en_1.0.0_3.0_1678111732690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[European_Construction]|
|[Other]|
|[Other]|
|[European_Construction]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_european_construction_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.8 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
European_Construction 0.85 0.90 0.87 535
Other 0.88 0.83 0.86 505
accuracy - - 0.87 1040
macro-avg 0.87 0.87 0.87 1040
weighted-avg 0.87 0.87 0.87 1040
```
---
layout: model
title: French CamemBert Embeddings (from adeiMousa)
author: John Snow Labs
name: camembert_embeddings_adeiMousa_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `adeiMousa`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_adeiMousa_generic_model_fr_3.4.4_3.0_1653987280320.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_adeiMousa_generic_model_fr_3.4.4_3.0_1653987280320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_adeiMousa_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_adeiMousa_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_adeiMousa_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/adeiMousa/dummy-model
---
layout: model
title: English Named Entity Recognition (from DeDeckerThomas)
author: John Snow Labs
name: distilbert_ner_keyphrase_extraction_distilbert_openkp
date: 2022-05-16
tags: [distilbert, ner, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-openkp` is a English model orginally trained by `DeDeckerThomas`.
## Predicted Entities
`KEY`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_keyphrase_extraction_distilbert_openkp_en_3.4.2_3.0_1652721945024.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_keyphrase_extraction_distilbert_openkp_en_3.4.2_3.0_1652721945024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_keyphrase_extraction_distilbert_openkp","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_keyphrase_extraction_distilbert_openkp","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_ner_keyphrase_extraction_distilbert_openkp|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/DeDeckerThomas/keyphrase-extraction-distilbert-openkp
- https://github.com/microsoft/OpenKP
- https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=openkp
---
layout: model
title: Detect diseases in text (large)
author: John Snow Labs
name: ner_diseases_large
date: 2021-04-01
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract mentions of different types of disease in medical text using pretrained NER model.
## Predicted Entities
`Disease`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DIAG_PROC/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_large_en_3.0.0_3.0_1617260844811.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_large_en_3.0.0_3.0_1617260844811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_diseases_large", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_diseases_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.diseases.large").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_diseases_large|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
---
layout: model
title: Clinical Deidentification Pipeline (English, slim)
author: John Snow Labs
name: clinical_deidentification_slim
date: 2023-06-13
tags: [deidentification, deid, glove, slim, pipeline, clinical, en, licensed]
task: Pipeline Healthcare
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline is trained with lightweight `glove_100d` embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_slim_en_4.4.4_3.2_1686665745769.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_slim_en_4.4.4_3.2_1686665745769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification_slim", "en", "clinical/models")
sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
result = deid_pipeline.annotate(sample)
print("\n".join(result['masked']))
print("\n".join(result['masked_with_chars']))
print("\n".join(result['masked_fixed_length_chars']))
print("\n".join(result['obfuscated']))
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification_slim","en","clinical/models")
val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.de_identify.clinical_slim").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification_slim", "en", "clinical/models")
sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
result = deid_pipeline.annotate(sample)
print("\n".join(result['masked']))
print("\n".join(result['masked_with_chars']))
print("\n".join(result['masked_fixed_length_chars']))
print("\n".join(result['obfuscated']))
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification_slim","en","clinical/models")
val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.de_identify.clinical_slim").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""")
```
## Results
```bash
Results
Masked with entity labels
------------------------------
Name : , Record date: , # .
Dr. , ID: , IP .
He is a male was admitted to the for cystectomy on .
Patient's VIN : , SSN , Driver's license .
Phone , , , E-MAIL: .
Masked with chars
------------------------------
Name : [**************], Record date: [********], # [****].
Dr. [********], ID: [********], IP [************].
He is a [*********] male was admitted to the [**********] for cystectomy on [******].
Patient's VIN : [***************], SSN [**********], Driver's license [*********].
Phone [************], [***************], [***********], E-MAIL: [*************].
Masked with fixed length chars
------------------------------
Name : ****, Record date: ****, # ****.
Dr. ****, ID: ****, IP ****.
He is a **** male was admitted to the **** for cystectomy on ****.
Patient's VIN : ****, SSN ****, Driver's license ****.
Phone ****, ****, ****, E-MAIL: ****.
Obfuscated
------------------------------
Name : Layne Nation, Record date: 2093-03-13, # C6240488.
Dr. Dr Rosalba Hill, ID: JY:3489547, IP 005.005.005.005.
He is a 79 male was admitted to the JOHN MUIR MEDICAL CENTER-CONCORD CAMPUS for cystectomy on 01-25-1997.
Patient's VIN : 3CCCC22DDDD333888, SSN SSN-289-37-4495, Driver's license S99983662.
Phone 04.32.52.27.90, North Adrienne, Colorado Springs, E-MAIL: Rawland@google.com.
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification_slim|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|181.9 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- ChunkMergeModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- TextMatcherModel
- ContextualParserModel
- RegexMatcherModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- ChunkMergeModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- Finisher
---
layout: model
title: English BertForQuestionAnswering model (from twmkn9)
author: John Snow Labs
name: bert_qa_twmkn9_bert_base_uncased_squad2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad2` is a English model orginally trained by `twmkn9`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_twmkn9_bert_base_uncased_squad2_en_4.0.0_3.0_1654181501175.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_twmkn9_bert_base_uncased_squad2_en_4.0.0_3.0_1654181501175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_twmkn9_bert_base_uncased_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_twmkn9_bert_base_uncased_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert.base_uncased.by_twmkn9").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_twmkn9_bert_base_uncased_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/twmkn9/bert-base-uncased-squad2
---
layout: model
title: English Bert Embeddings (Large, Uncased)
author: John Snow Labs
name: bert_embeddings_bert_large_uncased_whole_word_masking
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-uncased-whole-word-masking` is a English model orginally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_uncased_whole_word_masking_en_3.4.2_3.0_1649671495082.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_uncased_whole_word_masking_en_3.4.2_3.0_1649671495082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_uncased_whole_word_masking","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_uncased_whole_word_masking","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.bert_large_uncased_whole_word_masking").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_large_uncased_whole_word_masking|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/bert-large-uncased-whole-word-masking
- https://arxiv.org/abs/1810.04805
- https://github.com/google-research/bert
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
---
layout: model
title: Sentence Entity Resolver for CPT codes (Augmented)
author: John Snow Labs
name: sbiobertresolve_cpt_procedures_augmented
date: 2021-05-30
tags: [licensed, entity_resolution, en, clinical]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.4
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to CPT codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. This model is enriched with augmented data for better performance.
## Predicted Entities
CPT codes and their descriptions.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_augmented_en_3.0.4_3.0_1622371775342.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_augmented_en_3.0.4_3.0_1622371775342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_hcc_augmented","en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver])
data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
...
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_hcc_augmented","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver))
val data = Seq.empty["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."].toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.cpt.procedures_augmented").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""")
```
## Results
```bash
+--------------------+-----+---+-------+-----+----------+--------------------+--------------------+
| chunk|begin|end| entity| code|confidence| all_k_resolutions| all_k_codes|
+--------------------+-----+---+-------+-----+----------+--------------------+--------------------+
| hypertension| 68| 79|PROBLEM|36440| 0.3349|Hypertransfusion:...|36440:::24935:::0...|
|chronic renal ins...| 83|109|PROBLEM|50395| 0.0821|Nephrostomy:::Ren...|50395:::50328:::5...|
| COPD| 113|116|PROBLEM|32960| 0.1575|Lung collapse pro...|32960:::32215:::1...|
| gastritis| 120|128|PROBLEM|43501| 0.1772|Gastric ulcer sut...|43501:::43631:::4...|
| TIA| 136|138|PROBLEM|61460| 0.1432|Intracranial tran...|61460:::64742:::2...|
|a non-ST elevatio...| 182|202|PROBLEM|61624| 0.1151|Percutaneous non-...|61624:::61626:::3...|
|Guaiac positive s...| 208|229|PROBLEM|44005| 0.1115|Enterolysis:::Abd...|44005:::49080:::4...|
| mid LAD lesion| 332|345|PROBLEM|0281T| 0.2407|Plication of left...|0281T:::93462:::9...|
| hypotension| 362|372|PROBLEM|99135| 0.9935|Induced hypotensi...|99135:::99185:::9...|
| bradycardia| 378|388|PROBLEM|99135| 0.3884|Induced hypotensi...|99135:::33305:::3...|
| vagal reaction| 466|479|PROBLEM|55450| 0.1427|Vasoligation:::Va...|55450:::64408:::7...|
+--------------------+-----+---+-------+-----+----------+--------------------+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_cpt_procedures_augmented|
|Compatibility:|Healthcare NLP 3.0.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[cpt_code_aug]|
|Language:|en|
|Case sensitive:|false|
---
layout: model
title: Sentence Entity Resolver for RxCUI (``sbiobert_base_cased_mli`` embeddings)
author: John Snow Labs
name: sbiobertresolve_rxcui
language: en
nav_key: models
repository: clinical/models
date: 2020-12-11
task: Entity Resolution
edition: Healthcare NLP 2.6.5
spark_version: 2.4
tags: [clinical,entity_resolution,en]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model maps extracted medical entities to RxCUI codes using chunk embeddings.
{:.h2_title}
## Predicted Entities
RxCUI Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxcui_en_2.6.4_2.4_1607714146277.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxcui_en_2.6.4_2.4_1607714146277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
```sbiobertresolve_rxcui``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_posology``` as NER model. ```DRUG``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
rxcui_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_rxcui","en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxcui_resolver])
data = spark.createDataFrame([["He was seen by the endocrinology service and she was discharged on 50 mg of eltrombopag oral at night, 5 mg amlodipine with meals, and metformin 1000 mg two times a day"]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
...
val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val rxcui_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_rxcui","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxcui_resolver))
val data = Seq("He was seen by the endocrinology service and she was discharged on 50 mg of eltrombopag oral at night, 5 mg amlodipine with meals, and metformin 1000 mg two times a day").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
+---------------------------+--------+-----------------------------------------------------+
| chunk | code | term |
+---------------------------+--------+-----------------------------------------------------+
| 50 mg of eltrombopag oral | 825427 | eltrombopag 50 MG Oral Tablet |
| 5 mg amlodipine | 197361 | amlodipine 5 MG Oral Tablet |
| metformin 1000 mg | 861004 | metformin hydrochloride 2000 MG Oral Tablet |
+---------------------------+--------+-----------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---------------|---------------------|
| Name: | sbiobertresolve_rxcui |
| Type: | SentenceEntityResolverModel |
| Compatibility: | Spark NLP 2.6.5 + |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [ner_chunk, chunk_embeddings] |
|Output labels: | [resolution] |
| Language: | en |
| Dependencies: | sbiobert_base_cased_mli |
{:.h2_title}
## Data Source
Trained on November 2020 RxNorm Clinical Drugs ontology graph with ``sbiobert_base_cased_mli`` embeddings.
https://www.nlm.nih.gov/pubs/techbull/nd20/brief/nd20_rx_norm_november_release.html.
[Sample Content](https://rxnav.nlm.nih.gov/REST/rxclass/class/byRxcui.json?rxcui=1000000).
---
layout: model
title: Part of Speech for Turkish
author: John Snow Labs
name: pos_ud_imst
date: 2021-03-08
tags: [part_of_speech, open_source, turkish, pos_ud_imst, tr]
task: Part of Speech Tagging
language: tr
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- ADJ
- PROPN
- PUNCT
- ADP
- NOUN
- VERB
- PRON
- ADV
- NUM
- AUX
- CCONJ
- DET
- INTJ
- X
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_imst_tr_3.0.0_3.0_1615230214154.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_imst_tr_3.0.0_3.0_1615230214154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_imst", "tr") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([["John Snow Labs'tan merhaba! "]], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_imst", "tr")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("John Snow Labs'tan merhaba! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""John Snow Labs'tan merhaba! ""]
token_df = nlu.load('tr.pos.ud_imst').predict(text)
token_df
```
## Results
```bash
token pos
0 John NOUN
1 Snow PROPN
2 Labs'tan PROPN
3 merhaba NOUN
4 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_imst|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|tr|
---
layout: model
title: English asr_wav2vec2_large_960h_lv60_self TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_960h_lv60_self
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60_self` is a English model originally trained by facebook.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_960h_lv60_self_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_self_en_4.2.0_3.0_1664037014758.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_self_en_4.2.0_3.0_1664037014758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_960h_lv60_self', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_960h_lv60_self", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_960h_lv60_self|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|757.3 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011)
author: John Snow Labs
name: distilbert_token_classifier_autotrain_final_784824218
date: 2023-03-03
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824218` is a English model originally trained by `Lucifermorningstar011`.
## Predicted Entities
`9`, `0`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824218_en_4.3.1_3.0_1677881805603.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824218_en_4.3.1_3.0_1677881805603.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824218","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824218","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_autotrain_final_784824218|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Lucifermorningstar011/autotrain-final-784824218
---
layout: model
title: Part of Speech for Romanian
author: John Snow Labs
name: pos_ud_rrt
date: 2020-05-04 23:32:00 +0800
task: Part of Speech Tagging
language: ro
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [pos, ro]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_rrt_ro_2.5.0_2.4_1588622539956.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_rrt_ro_2.5.0_2.4_1588622539956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_rrt", "ro") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale.")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_rrt", "ro")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale."""]
pos_df = nlu.load('ro.pos.ud_rrt').predict(text, output_level='token')
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=1, result='ADP', metadata={'word': 'În'}),
Row(annotatorType='pos', begin=3, end=7, result='ADV', metadata={'word': 'afară'}),
Row(annotatorType='pos', begin=9, end=10, result='ADP', metadata={'word': 'de'}),
Row(annotatorType='pos', begin=12, end=12, result='PART', metadata={'word': 'a'}),
Row(annotatorType='pos', begin=14, end=15, result='AUX', metadata={'word': 'fi'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_rrt|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|ro|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Chinese BertForMaskedLM Cased model (from hfl)
author: John Snow Labs
name: bert_embeddings_rbt4_h312
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt4-h312` is a Chinese model originally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_h312_zh_4.2.4_3.0_1670327101139.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_h312_zh_4.2.4_3.0_1670327101139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4_h312","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4_h312","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_rbt4_h312|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|43.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/rbt4-h312
- https://github.com/iflytek/MiniRBT
- https://github.com/ymcui/LERT
- https://github.com/ymcui/PERT
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/iflytek/HFL-Anthology
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from nlpunibo)
author: John Snow Labs
name: distilbert_qa_base_config1
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_config1` is a English model originally trained by `nlpunibo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config1_en_4.3.0_3.0_1672774413705.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config1_en_4.3.0_3.0_1672774413705.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_config1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/nlpunibo/distilbert_base_config1
---
layout: model
title: Extract Entities in Spanish Clinical Trial Abstracts (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_clinical_trials_abstracts
date: 2022-08-11
tags: [es, clinical, licensed, token_classification, bert, ner]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.0.2
spark_version: 3.0
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Named Entity Recognition model is intended for detecting relevant entities from Spanish clinical trial abstracts and trained using the BertForTokenClassification method from the transformers library and [BERT based](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) embeddings.
The model detects Pharmacological and Chemical Substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC).
## Predicted Entities
`CHEM`, `DISO`, `PROC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_es_4.0.2_3.0_1660229117151.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_es_4.0.2_3.0_1660229117151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "es", "clinical/models")\
.setInputCols("token", "sentence")\
.setOutputCol("label")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["sentence","token","label"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame(["""Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales."""], StringType()).toDF("text")
result = model.transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "es", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("label")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","label"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter))
val data = Seq(Array("""Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales.""")).toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.classify.bert_token.clinical_trials_abstract").predict("""Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales.""")
```
## Results
```bash
+-----------------------+---------+
|chunk |ner_label|
+-----------------------+---------+
|suplementación |PROC |
|ácido fólico |CHEM |
|niveles de homocisteína|PROC |
|hemodiálisis |PROC |
|hiperhomocisteinemia |DISO |
|niveles de homocisteína|PROC |
|tHcy |PROC |
|ácido fólico |CHEM |
|vitamina B6 |CHEM |
|pp |CHEM |
|diálisis |PROC |
|función residual |PROC |
+-----------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_clinical_trials_abstracts|
|Compatibility:|Healthcare NLP 4.0.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|410.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
The model is prepared using the reference paper: "A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine", Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión and Antonio Moreno-Sandoval. BMC Medical Informatics and Decision Making volume 21, Article number: 69 (2021)
## Benchmarking
```bash
label precision recall f1-score support
B-CHEM 0.9335 0.9314 0.9325 4944
I-CHEM 0.8210 0.8689 0.8443 1251
B-DISO 0.9406 0.9429 0.9417 5538
I-DISO 0.9071 0.9115 0.9093 5129
B-PROC 0.8850 0.9113 0.8979 5893
I-PROC 0.8711 0.8615 0.8663 7047
micro-avg 0.9010 0.9070 0.9040 29802
macro-avg 0.8930 0.9046 0.8987 29802
weighted-avg 0.9012 0.9070 0.9040 29802
```
---
layout: model
title: Estonian Legal Roberta Embeddings
author: John Snow Labs
name: roberta_base_estonian_legal
date: 2023-02-16
tags: [et, estonian, embeddings, transformer, open_source, legal, tensorflow]
task: Embeddings
language: et
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-estonian-roberta-base` is a Estonian model originally trained by `joelito`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_estonian_legal_et_4.2.4_3.0_1676577830758.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_estonian_legal_et_4.2.4_3.0_1676577830758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_base_estonian_legal|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|et|
|Size:|416.0 MB|
|Case sensitive:|true|
## References
https://huggingface.co/joelito/legal-estonian-roberta-base
---
layout: model
title: NER Pipeline for Anatomy Entities - Voice of the Patient
author: John Snow Labs
name: ner_vop_anatomy_pipeline
date: 2023-06-09
tags: [licensed, ner, en, pipeline, vop, anatomy]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline extracts mentions of anatomical sites from health-related text in colloquial language.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_pipeline_en_4.4.3_3.0_1686341261132.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_pipeline_en_4.4.3_3.0_1686341261132.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_vop_anatomy_pipeline", "en", "clinical/models")
pipeline.annotate("
Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head.
")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_vop_anatomy_pipeline", "en", "clinical/models")
val result = pipeline.annotate("
Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head.
")
```
## Results
```bash
| chunk | ner_label |
|:----------|:------------|
| muscle | BodyPart |
| neck | BodyPart |
| trapezius | BodyPart |
| head | BodyPart |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_anatomy_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|791.6 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_by_samantharhay TFWav2Vec2ForCTC from samantharhay
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_samantharhay` is a English model originally trained by samantharhay.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay_en_4.2.0_3.0_1664102981328.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay_en_4.2.0_3.0_1664102981328.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|354.8 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English BertForQuestionAnswering model (from niklaspm)
author: John Snow Labs
name: bert_qa_linkbert_large_finetuned_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `linkbert-large-finetuned-squad` is a English model orginally trained by `niklaspm`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_linkbert_large_finetuned_squad_en_4.0.0_3.0_1654188104988.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_linkbert_large_finetuned_squad_en_4.0.0_3.0_1654188104988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_linkbert_large_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_linkbert_large_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.link_bert.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_linkbert_large_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.2 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/niklaspm/linkbert-large-finetuned-squad
- https://arxiv.org/abs/2203.15827
---
layout: model
title: Translate Catalan to English Pipeline
author: John Snow Labs
name: translate_ca_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, ca, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `ca`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ca_en_xx_2.7.0_2.4_1609691497747.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ca_en_xx_2.7.0_2.4_1609691497747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_ca_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_ca_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.ca.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_ca_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: RCT Classifier (BioBERT)
author: John Snow Labs
name: bert_sequence_classifier_rct_biobert
date: 2022-03-01
tags: [licensed, sequence_classification, bert, en, rct]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 2.4
supported: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [BioBERT-based](https://github.com/dmis-lab/biobert) classifier that can classify the sections within the abstracts of scientific articles regarding randomized clinical trials (RCT).
## Predicted Entities
`BACKGROUND`, `CONCLUSIONS`, `METHODS`, `OBJECTIVE`, `RESULTS`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_rct_biobert_en_3.4.1_2.4_1646129655723.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_rct_biobert_en_3.4.1_2.4_1646129655723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier_loaded = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_rct_biobert", "en", "clinical/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier_loaded
])
data = spark.createDataFrame([["""Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl ."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_rct_biobert", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq("Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.clinical_trials").predict("""Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .""")
```
## Results
```bash
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|text |class |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|[Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .]|[BACKGROUND]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_rct_biobert|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.0 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
https://arxiv.org/abs/1710.06071
## Benchmarking
```bash
label precision recall f1-score support
BACKGROUND 0.77 0.86 0.81 2000
CONCLUSIONS 0.96 0.95 0.95 2000
METHODS 0.96 0.98 0.97 2000
OBJECTIVE 0.85 0.77 0.81 2000
RESULTS 0.98 0.95 0.96 2000
accuracy 0.9 0.9 0.9 10000
macro-avg 0.9 0.9 0.9 10000
weighted-avg 0.9 0.9 0.9 10000
```
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from SauravMaheshkar)
author: John Snow Labs
name: xlm_roberta_qa_xlm_multi_roberta_large_chaii
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-multi-roberta-large-chaii` is a English model originally trained by `SauravMaheshkar`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_multi_roberta_large_chaii_en_4.0.0_3.0_1655988703987.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_multi_roberta_large_chaii_en_4.0.0_3.0_1655988703987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_multi_roberta_large_chaii","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_multi_roberta_large_chaii","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.chaii.xlm_roberta.large_multi.by_SauravMaheshkar").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_multi_roberta_large_chaii|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.9 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/SauravMaheshkar/xlm-multi-roberta-large-chaii
---
layout: model
title: Legal Sub Advisory Agreement Document Binary Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_sub_advisory_agreement_bert
date: 2022-12-18
tags: [en, legal, classification, licensed, document, bert, sub, advisory, agreement, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_sub_advisory_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `sub-advisory-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`sub-advisory-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sub_advisory_agreement_bert_en_1.0.0_3.0_1671393859780.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sub_advisory_agreement_bert_en_1.0.0_3.0_1671393859780.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[sub-advisory-agreement]|
|[other]|
|[other]|
|[sub-advisory-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_sub_advisory_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.99 0.99 0.99 204
sub-advisory-agreement 0.98 0.98 0.98 107
accuracy - - 0.99 311
macro-avg 0.99 0.99 0.99 311
weighted-avg 0.99 0.99 0.99 311
```
---
layout: model
title: Recognize Entities OntoNotes - ELECTRA Large
author: John Snow Labs
name: onto_recognize_entities_electra_large
date: 2020-12-09
task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public]
language: en
nav_key: models
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [en, open_source, pipeline]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `electra_large_uncased` embeddings. It can extract up to following 18 entities:
## Predicted Entities
`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_large_en_2.7.0_2.4_1607530726468.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_large_en_2.7.0_2.4_1607530726468.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('onto_recognize_entities_electra_large')
result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("onto_recognize_entities_electra_large")
val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.ner.onto.large").predict("""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.""")
```
{:.h2_title}
## Results
```bash
+------------+---------+
|chunk |ner_label|
+------------+---------+
|Johnson |PERSON |
|first |ORDINAL |
|2001 |DATE |
|eight years |DATE |
|London |GPE |
|2008 to 2016|DATE |
+------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|onto_recognize_entities_electra_large|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|en|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- Tokenizer
- BertEmbeddings
- NerDLModel
- NerConverter
---
layout: model
title: Part of Speech for Dutch
author: John Snow Labs
name: pos_ud_alpino
date: 2021-03-08
tags: [part_of_speech, open_source, dutch, pos_ud_alpino, nl]
task: Part of Speech Tagging
language: nl
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- PRON
- AUX
- ADV
- VERB
- PUNCT
- ADP
- NUM
- NOUN
- SCONJ
- DET
- ADJ
- PROPN
- CCONJ
- SYM
- X
- INTJ
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_alpino_nl_3.0.0_3.0_1615230249057.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_alpino_nl_3.0.0_3.0_1615230249057.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_alpino", "nl") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['Hallo van John Snow Labs! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_alpino", "nl")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("Hallo van John Snow Labs! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Hallo van John Snow Labs! ""]
token_df = nlu.load('nl.pos.ud_alpino').predict(text)
token_df
```
## Results
```bash
token pos
0 Hallo PROPN
1 van ADP
2 John PROPN
3 Snow PROPN
4 Labs PROPN
5 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_alpino|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|nl|
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_8_h_768
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-8_H-768` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_768_zh_4.2.4_3.0_1670021813941.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_768_zh_4.2.4_3.0_1670021813941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_768","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_768","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_8_h_768|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|277.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-8_H-768
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: English BertForQuestionAnswering Cased model (from Callmenicky)
author: John Snow Labs
name: bert_qa_callmenicky_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `Callmenicky`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_callmenicky_finetuned_squad_en_4.0.0_3.0_1657185991318.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_callmenicky_finetuned_squad_en_4.0.0_3.0_1657185991318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_callmenicky_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_callmenicky_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_callmenicky_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Callmenicky/bert-finetuned-squad
---
layout: model
title: Pipeline to Detect Organism in Medical Texts
author: John Snow Labs
name: bert_token_classifier_ner_linnaeus_species_pipeline
date: 2023-03-20
tags: [en, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_linnaeus_species](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_linnaeus_species_en_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_linnaeus_species_pipeline_en_4.3.0_3.2_1679303734578.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_linnaeus_species_pipeline_en_4.3.0_3.2_1679303734578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_linnaeus_species_pipeline", "en", "clinical/models")
text = '''First identified in chicken, vigilin homologues have now been found in human (6), Xenopus laevis (7), Drosophila melanogaster (8) and Schizosaccharomyces pombe.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_linnaeus_species_pipeline", "en", "clinical/models")
val text = "First identified in chicken, vigilin homologues have now been found in human (6), Xenopus laevis (7), Drosophila melanogaster (8) and Schizosaccharomyces pombe."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:--------------------------|--------:|------:|:------------|-------------:|
| 0 | chicken | 20 | 26 | SPECIES | 0.998697 |
| 1 | human | 71 | 75 | SPECIES | 0.999767 |
| 2 | Xenopus laevis | 82 | 95 | SPECIES | 0.999918 |
| 3 | Drosophila melanogaster | 102 | 124 | SPECIES | 0.999925 |
| 4 | Schizosaccharomyces pombe | 134 | 158 | SPECIES | 0.999881 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_linnaeus_species_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: English BertForQuestionAnswering model (from clagator)
author: John Snow Labs
name: bert_qa_biobert_squad2_cased
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert_squad2_cased` is a English model orginally trained by `clagator`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_squad2_cased_en_4.0.0_3.0_1654185692636.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_squad2_cased_en_4.0.0_3.0_1654185692636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_squad2_cased","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_biobert_squad2_cased","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.biobert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_biobert_squad2_cased|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|403.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/clagator/biobert_squad2_cased
---
layout: model
title: Sundanese asr_wav2vec2_large_xlsr_sundanese TFWav2Vec2ForCTC from cahya
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_sundanese
date: 2022-09-24
tags: [wav2vec2, su, audio, open_source, asr]
task: Automatic Speech Recognition
language: su
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_sundanese` is a Sundanese model originally trained by cahya.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_sundanese_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_sundanese_su_4.2.0_3.0_1664039112850.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_sundanese_su_4.2.0_3.0_1664039112850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_sundanese", "su")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_sundanese", "su")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_sundanese|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|su|
|Size:|1.2 GB|
---
layout: model
title: French Bert Embeddings
author: John Snow Labs
name: bert_embeddings_bert_base_fr_cased
date: 2022-04-11
tags: [bert, embeddings, fr, open_source]
task: Embeddings
language: fr
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-fr-cased` is a French model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_fr_cased_fr_3.4.2_3.0_1649675673587.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_fr_cased_fr_3.4.2_3.0_1649675673587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_fr_cased","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark Nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_fr_cased","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark Nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("fr.embed.bert_base_fr_cased").predict("""J'adore Spark Nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_fr_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|fr|
|Size:|393.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-fr-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Translate Latvian to English Pipeline
author: John Snow Labs
name: translate_lv_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, lv, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `lv`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_lv_en_xx_2.7.0_2.4_1609690492134.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_lv_en_xx_2.7.0_2.4_1609690492134.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_lv_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_lv_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.lv.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_lv_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Detect Clinical Entities
author: John Snow Labs
name: ner_jsl_greedy_biobert_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_jsl_greedy_biobert](https://nlp.johnsnowlabs.com/2021/08/13/ner_jsl_greedy_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_biobert_pipeline_en_3.4.1_3.0_1647869992577.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_biobert_pipeline_en_3.4.1_3.0_1647869992577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_jsl_greedy_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
```scala
val pipeline = new PretrainedPipeline("ner_jsl_greedy_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.biobert_jsl_greedy.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
| | chunk | entity |
|---:|:-----------------------------------------------|:-----------------------------|
| 0 | 21-day-old | Age |
| 1 | Caucasian | Race_Ethnicity |
| 2 | male | Gender |
| 3 | for 2 days | Duration |
| 4 | congestion | Symptom |
| 5 | mom | Gender |
| 6 | suctioning yellow discharge | Symptom |
| 7 | nares | External_body_part_or_region |
| 8 | she | Gender |
| 9 | mild problems with his breathing while feeding | Symptom |
| 10 | perioral cyanosis | Symptom |
| 11 | retractions | Symptom |
| 12 | One day ago | RelativeDate |
| 13 | mom | Gender |
| 14 | tactile temperature | Symptom |
| 15 | Tylenol | Drug |
| 16 | Baby | Age |
| 17 | decreased p.o. intake | Symptom |
| 18 | His | Gender |
| 19 | breast-feeding | External_body_part_or_region |
| 20 | q.2h | Frequency |
| 21 | to 5 to 10 minutes | Duration |
| 22 | his | Gender |
| 23 | respiratory congestion | Symptom |
| 24 | He | Gender |
| 25 | tired | Symptom |
| 26 | fussy | Symptom |
| 27 | over the past 2 days | RelativeDate |
| 28 | albuterol | Drug |
| 29 | ER | Clinical_Dept |
| 30 | His | Gender |
| 31 | urine output has also decreased | Symptom |
| 32 | he | Gender |
| 33 | per 24 hours | Frequency |
| 34 | he | Gender |
| 35 | per 24 hours | Frequency |
| 36 | Mom | Gender |
| 37 | diarrhea | Symptom |
| 38 | His | Gender |
| 39 | bowel | Internal_organ_or_component |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_jsl_greedy_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from microsoft)
author: John Snow Labs
name: roberta_qa_xdoc_base_squad2.0
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xdoc-base-squad2.0` is a English model originally trained by `microsoft`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_xdoc_base_squad2.0_en_4.3.0_3.0_1674224984469.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_xdoc_base_squad2.0_en_4.3.0_3.0_1674224984469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_xdoc_base_squad2.0","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_xdoc_base_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_xdoc_base_squad2.0|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|466.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/microsoft/xdoc-base-squad2.0
- https://arxiv.org/abs/2210.02849
---
layout: model
title: English image_classifier_vit_Check_Missing_Teeth ViTForImageClassification from steven123
author: John Snow Labs
name: image_classifier_vit_Check_Missing_Teeth
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Check_Missing_Teeth` is a English model originally trained by steven123.
## Predicted Entities
`Missing Teeth`, `Non-Missing Teeth`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Check_Missing_Teeth_en_4.1.0_3.0_1660167758083.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Check_Missing_Teeth_en_4.1.0_3.0_1660167758083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_Check_Missing_Teeth", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_Check_Missing_Teeth", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_Check_Missing_Teeth|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Fast Neural Machine Translation Model from Luvale to English
author: John Snow Labs
name: opus_mt_lue_en
date: 2020-12-29
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, lue, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `lue`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_lue_en_xx_2.7.0_2.4_1609254447650.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_lue_en_xx_2.7.0_2.4_1609254447650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_lue_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_lue_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.lue.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_lue_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Persian BertForQuestionAnswering Base Uncased model (from mhmsadegh)
author: John Snow Labs
name: bert_qa_base_parsbert_uncased_finetuned_squad
date: 2022-07-07
tags: [fa, open_source, bert, question_answering]
task: Question Answering
language: fa
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-parsbert-uncased-finetuned-squad` is a Persian model originally trained by `mhmsadegh`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_parsbert_uncased_finetuned_squad_fa_4.0.0_3.0_1657183402278.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_parsbert_uncased_finetuned_squad_fa_4.0.0_3.0_1657183402278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_parsbert_uncased_finetuned_squad","fa") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_parsbert_uncased_finetuned_squad","fa")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_parsbert_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|fa|
|Size:|607.1 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/mhmsadegh/bert-base-parsbert-uncased-finetuned-squad
---
layout: model
title: Legal Modifications Clause Binary Classifier
author: John Snow Labs
name: legclf_modifications_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `modifications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `modifications`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_modifications_clause_en_1.0.0_3.2_1660123743355.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_modifications_clause_en_1.0.0_3.2_1660123743355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[modifications]|
|[other]|
|[other]|
|[modifications]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_modifications_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.1 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
modifications 0.89 0.83 0.86 76
other 0.92 0.95 0.94 168
accuracy - - 0.91 244
macro-avg 0.91 0.89 0.90 244
weighted-avg 0.91 0.91 0.91 244
```
---
layout: model
title: English BERT Embeddings Cased model (from mrm8488)
author: John Snow Labs
name: bert_embeddings_bioclinicalbert_finetuned_covid_papers
date: 2022-07-15
tags: [en, open_source, bert, embeddings]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BERT Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bioclinicalBERT-finetuned-covid-papers` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bioclinicalbert_finetuned_covid_papers_en_4.0.0_3.0_1657880858798.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bioclinicalbert_finetuned_covid_papers_en_4.0.0_3.0_1657880858798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bioclinicalbert_finetuned_covid_papers","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bioclinicalbert_finetuned_covid_papers","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bioclinicalbert_finetuned_covid_papers|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|406.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/mrm8488/bioclinicalBERT-finetuned-covid-papers
---
layout: model
title: Detect Assertion Status from Response to Treatment
author: John Snow Labs
name: assertion_oncology_response_to_treatment_wip
date: 2022-10-01
tags: [licensed, clinical, oncology, en, assertion]
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.0
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model detects the assertion status of entities related to response to treatment. The model identifies positive mentions (Present_Or_Past status), and hypothetical or absent mentions (Hypothetical_Or_Absent status).
## Predicted Entities
`Hypothetical_Or_Absent`, `Present_Or_Past`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_response_to_treatment_wip_en_4.1.0_3.0_1664641698152.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_response_to_treatment_wip_en_4.1.0_3.0_1664641698152.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["Response_To_Treatment"])
assertion = AssertionDLModel.pretrained("assertion_oncology_response_to_treatment_wip", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
assertion])
data = spark.createDataFrame([["The patient presented no evidence of recurrence."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Response_To_Treatment"))
val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_response_to_treatment_wip","en","clinical/models")
.setInputCols(Array("sentence","ner_chunk","embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
assertion))
val data = Seq("""The patient presented no evidence of recurrence.""").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert.oncology_response_to_treatment_wip").predict("""The patient presented no evidence of recurrence.""")
```
## Results
```bash
| chunk | ner_label | assertion |
|:-----------|:----------------------|:-----------------------|
| recurrence | Response_To_Treatment | Hypothetical_Or_Absent |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|assertion_oncology_response_to_treatment_wip|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, chunk, embeddings]|
|Output Labels:|[assertion_pred]|
|Language:|en|
|Size:|1.4 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label precision recall f1-score support
Hypothetical_Or_Absent 0.83 0.96 0.89 46.0
Present_Or_Past 0.94 0.79 0.86 43.0
macro-avg 0.89 0.87 0.87 89.0
weighted-avg 0.89 0.88 0.88 89.0
```
---
layout: model
title: English BertForQuestionAnswering model (from ZYW)
author: John Snow Labs
name: bert_qa_squad_mbert_model
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-mbert-model` is a English model orginally trained by `ZYW`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_mbert_model_en_4.0.0_3.0_1654192040161.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_mbert_model_en_4.0.0_3.0_1654192040161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_mbert_model","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_squad_mbert_model","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.multi_lingual_bert.by_ZYW").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_squad_mbert_model|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ZYW/squad-mbert-model
---
layout: model
title: Sentence Detection in Tamil Text
author: John Snow Labs
name: sentence_detector_dl
date: 2021-08-30
tags: [ta, open_source, sentence_detection]
task: Sentence Detection
language: ta
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ta_3.2.0_3.0_1630337465197.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ta_3.2.0_3.0_1630337465197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl", "ta") \
.setInputCols(["document"]) \
.setOutputCol("sentences")
sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL]))
sd_model.fullAnnotate("""ஆங்கில வாசிப்பு பத்திகளின் சிறந்த ஆதாரத்தைத் தேடுகிறீர்களா? நீங்கள் சரியான இடத்திற்கு வந்துவிட்டீர்கள். சமீபத்திய ஆய்வின்படி, இன்றைய இளைஞர்களிடம் படிக்கும் பழக்கம் வேகமாக குறைந்து வருகிறது. கொடுக்கப்பட்ட ஆங்கில வாசிப்பு பத்தியில் சில வினாடிகளுக்கு மேல் அவர்களால் கவனம் செலுத்த முடியாது! மேலும், அனைத்து போட்டித் தேர்வுகளிலும் வாசிப்பு ஒரு ஒருங்கிணைந்த பகுதியாகும். எனவே, உங்கள் வாசிப்புத் திறனை எவ்வாறு மேம்படுத்துவது? இந்த கேள்விக்கான பதில் உண்மையில் மற்றொரு கேள்வி: வாசிப்பு திறனின் பயன் என்ன? வாசிப்பின் முக்கிய நோக்கம் 'உணர்த்துவது'.""")
```
```scala
val documenter = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "ta")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val pipeline = new Pipeline().setStages(Array(documenter, model))
val data = Seq("ஆங்கில வாசிப்பு பத்திகளின் சிறந்த ஆதாரத்தைத் தேடுகிறீர்களா? நீங்கள் சரியான இடத்திற்கு வந்துவிட்டீர்கள். சமீபத்திய ஆய்வின்படி, இன்றைய இளைஞர்களிடம் படிக்கும் பழக்கம் வேகமாக குறைந்து வருகிறது. கொடுக்கப்பட்ட ஆங்கில வாசிப்பு பத்தியில் சில வினாடிகளுக்கு மேல் அவர்களால் கவனம் செலுத்த முடியாது! மேலும், அனைத்து போட்டித் தேர்வுகளிலும் வாசிப்பு ஒரு ஒருங்கிணைந்த பகுதியாகும். எனவே, உங்கள் வாசிப்புத் திறனை எவ்வாறு மேம்படுத்துவது? இந்த கேள்விக்கான பதில் உண்மையில் மற்றொரு கேள்வி: வாசிப்பு திறனின் பயன் என்ன? வாசிப்பின் முக்கிய நோக்கம் 'உணர்த்துவது'.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load('ta.sentence_detector').predict("ஆங்கில வாசிப்பு பத்திகளின் சிறந்த ஆதாரத்தைத் தேடுகிறீர்களா? நீங்கள் சரியான இடத்திற்கு வந்துவிட்டீர்கள். சமீபத்திய ஆய்வின்படி, இன்றைய இளைஞர்களிடம் படிக்கும் பழக்கம் வேகமாக குறைந்து வருகிறது. கொடுக்கப்பட்ட ஆங்கில வாசிப்பு பத்தியில் சில வினாடிகளுக்கு மேல் அவர்களால் கவனம் செலுத்த முடியாது! மேலும், அனைத்து போட்டித் தேர்வுகளிலும் வாசிப்பு ஒரு ஒருங்கிணைந்த பகுதியாகும். எனவே, உங்கள் வாசிப்புத் திறனை எவ்வாறு மேம்படுத்துவது? இந்த கேள்விக்கான பதில் உண்மையில் மற்றொரு கேள்வி: வாசிப்பு திறனின் பயன் என்ன? வாசிப்பின் முக்கிய நோக்கம் 'உணர்த்துவது'.", output_level ='sentence')
```
## Results
```bash
+--------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------+
|[ஆங்கில வாசிப்பு பத்திகளின் சிறந்த ஆதாரத்தைத் தேடுகிறீர்களா?] |
|[நீங்கள் சரியான இடத்திற்கு வந்துவிட்டீர்கள்.] |
|[சமீபத்திய ஆய்வின்படி, இன்றைய இளைஞர்களிடம் படிக்கும் பழக்கம் வேகமாக குறைந்து வருகிறது.] |
|[கொடுக்கப்பட்ட ஆங்கில வாசிப்பு பத்தியில் சில வினாடிகளுக்கு மேல் அவர்களால் கவனம் செலுத்த முடியாது!]|
|[மேலும், அனைத்து போட்டித் தேர்வுகளிலும் வாசிப்பு ஒரு ஒருங்கிணைந்த பகுதியாகும்.] |
|[எனவே, உங்கள் வாசிப்புத் திறனை எவ்வாறு மேம்படுத்துவது?] |
|[இந்த கேள்விக்கான பதில் உண்மையில் மற்றொரு கேள்வி:] |
|[வாசிப்பு திறனின் பயன் என்ன?] |
|[வாசிப்பின் முக்கிய நோக்கம் 'உணர்த்துவது'.] |
+--------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sentence_detector_dl|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[sentences]|
|Language:|ta|
## Benchmarking
```bash
label Accuracy Recall Prec F1
0 0.98 1.00 0.96 0.98
```
---
layout: model
title: English BertForSequenceClassification Mini Cased model (from mrm8488)
author: John Snow Labs
name: bert_sequence_classifier_mini_finetuned_age_news_classification
date: 2022-07-13
tags: [en, open_source, bert, sequence_classification]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-mini-finetuned-age_news-classification` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_mini_finetuned_age_news_classification_en_4.0.0_3.0_1657720835247.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_mini_finetuned_age_news_classification_en_4.0.0_3.0_1657720835247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
classifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_mini_finetuned_age_news_classification","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val classifer = BertForSequenceClassification.pretrained("bert_sequence_classifier_mini_finetuned_age_news_classification","en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_mini_finetuned_age_news_classification|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|42.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mrm8488/bert-mini-finetuned-age_news-classification
---
layout: model
title: English LongformerForQuestionAnswering model (from manishiitg) Version-2
author: John Snow Labs
name: longformer_qa_recruit_v2
date: 2022-06-26
tags: [en, open_source, longformer, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: LongformerForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer-recruit-qa-v2` is a English model originally trained by `manishiitg`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_qa_recruit_v2_en_4.0.0_3.0_1656255752690.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_qa_recruit_v2_en_4.0.0_3.0_1656255752690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_recruit_v2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_recruit_v2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.longformer.v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|longformer_qa_recruit_v2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|556.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/manishiitg/longformer-recruit-qa-v2
---
layout: model
title: Fast Neural Machine Translation Model from English to Finnish
author: John Snow Labs
name: opus_mt_en_fi
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, fi, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `fi`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_fi_xx_2.7.0_2.4_1609167722467.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_fi_xx_2.7.0_2.4_1609167722467.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_fi", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_fi", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.fi').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_fi|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for Snomed Concepts, Body Structure Version (sbiobert_base_cased_mli)
author: John Snow Labs
name: sbiobertresolve_snomed_bodyStructure
date: 2021-07-08
tags: [snomed, en, clinical, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.1.0
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical (anatomical structures) entities to Snomed codes (body structure version) using sentence embeddings.
## Predicted Entities
Snomed Codes and their normalized definition with `sbiobert_base_cased_mli` embeddings.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_bodyStructure_en_3.1.0_2.4_1625732176926.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_bodyStructure_en_3.1.0_2.4_1625732176926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_snomed_bodyStructure``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Disease_Syndrome_Disorder,
External_body_part_or_region``` set in ```.setWhiteList()```.
```sbiobertresolve_snomed_bodyStructure``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. No need to set ```.setWhiteList()```.
Merge ner_jsl and ner_anatomy_coarse model chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
jsl_sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
snomed_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_snomed_bodyStructure", "en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("snomed_code")
snomed_pipelineModel = PipelineModel(
stages = [
documentAssembler,
jsl_sbert_embedder,
snomed_resolver])
snomed_lp = LightPipeline(snomed_pipelineModel)
result = snomed_lp.fullAnnotate("Amputation stump")
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val snomed_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_snomed_bodyStructure", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("snomed_code")
val snomed_pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, snomed_resolver))
val snomed_lp = LightPipeline(snomed_pipelineModel)
val result = snomed_lp.fullAnnotate("Amputation stump")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.snomed_body_structure").predict("""Amputation stump""")
```
## Results
```bash
| | chunks | code | resolutions | all_codes | all_distances |
|---:|:-----------------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------|
| 0 | amputation stump | 38033009 | [Amputation stump, Amputation stump of upper limb, Amputation stump of left upper limb, Amputation stump of lower limb, Amputation stump of left lower limb, Amputation stump of right upper limb, Amputation stump of right lower limb, ...]| ['38033009', '771359009', '771364008', '771358001', '771367001', '771365009', '771368006', ...] | ['0.0000', '0.0773', '0.0858', '0.0863', '0.0905', '0.0911', '0.0972', ...] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_snomed_bodyStructure|
|Compatibility:|Healthcare NLP 3.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[snomed_code]|
|Language:|en|
|Case sensitive:|true|
## Data Source
https://www.snomed.org/
---
layout: model
title: Arabic Bert Embeddings (from bashar-talafha)
author: John Snow Labs
name: bert_embeddings_multi_dialect_bert_base_arabic
date: 2022-04-11
tags: [bert, embeddings, ar, open_source]
task: Embeddings
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `multi-dialect-bert-base-arabic` is a Arabic model orginally trained by `bashar-talafha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_multi_dialect_bert_base_arabic_ar_3.4.2_3.0_1649677978634.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_multi_dialect_bert_base_arabic_ar_3.4.2_3.0_1649677978634.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_multi_dialect_bert_base_arabic","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_multi_dialect_bert_base_arabic","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.multi_dialect_bert_base_arabic").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_multi_dialect_bert_base_arabic|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|414.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/bashar-talafha/multi-dialect-bert-base-arabic
- https://ai.mawdoo3.com/
- https://github.com/alisafaya/Arabic-BERT
- https://sites.google.com/view/nadi-shared-task
- https://github.com/mawdoo3/Multi-dialect-Arabic-BERT
---
layout: model
title: English DistilBertForQuestionAnswering model (from caiosantillo)
author: John Snow Labs
name: distilbert_qa_caiosantillo_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `caiosantillo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_caiosantillo_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725142690.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_caiosantillo_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725142690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_caiosantillo_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_caiosantillo_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_caiosantillo").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_caiosantillo_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/caiosantillo/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Fast Neural Machine Translation Model from Icelandic to English
author: John Snow Labs
name: opus_mt_is_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, is, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `is`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_is_en_xx_2.7.0_2.4_1609167041296.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_is_en_xx_2.7.0_2.4_1609167041296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_is_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_is_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.is.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_is_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering model (from bhavikardeshna)
author: John Snow Labs
name: bert_qa_multilingual_bert_base_cased_english
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-english` is a English model orginally trained by `bhavikardeshna`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_english_en_4.0.0_3.0_1654188464581.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_english_en_4.0.0_3.0_1654188464581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_bert_base_cased_english","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_multilingual_bert_base_cased_english","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.multilingual_english_tuned_base_cased.by_bhavikardeshna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_multilingual_bert_base_cased_english|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|665.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-english
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_fpdm_pert_sent_0.01_squad2.0
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_roberta_pert_sent_0.01_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_pert_sent_0.01_squad2.0_en_4.3.0_3.0_1674211060807.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_pert_sent_0.01_squad2.0_en_4.3.0_3.0_1674211060807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_pert_sent_0.01_squad2.0","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_pert_sent_0.01_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_fpdm_pert_sent_0.01_squad2.0|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|460.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/fpdm_roberta_pert_sent_0.01_squad2.0
---
layout: model
title: Fast Neural Machine Translation Model from English to Xhosa
author: John Snow Labs
name: opus_mt_en_xh
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, xh, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `xh`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_xh_xx_2.7.0_2.4_1609163416968.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_xh_xx_2.7.0_2.4_1609163416968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_xh", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_xh", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.xh').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_xh|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_dm128
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dm128` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm128_en_4.3.0_3.0_1675118965696.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm128_en_4.3.0_3.0_1675118965696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_dm128","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_dm128","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_dm128|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|37.4 MB|
## References
- https://huggingface.co/google/t5-efficient-small-dm128
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English BertForQuestionAnswering model (from AnonymousSub)
author: John Snow Labs
name: bert_qa_fpdm_triplet_bert_FT_newsqa
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_triplet_bert_FT_newsqa` is a English model orginally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_triplet_bert_FT_newsqa_en_4.0.0_3.0_1654187915859.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_triplet_bert_FT_newsqa_en_4.0.0_3.0_1654187915859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fpdm_triplet_bert_FT_newsqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_fpdm_triplet_bert_FT_newsqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.news.bert.qa_fpdm_triplet_ft.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_fpdm_triplet_bert_FT_newsqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/fpdm_triplet_bert_FT_newsqa
---
layout: model
title: Detect Drug Information
author: John Snow Labs
name: ner_posology_en
date: 2020-04-15
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.4.2
spark_version: 2.4
tags: [ner, en, licensed, clinical]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Pretrained named entity recognition deep learning model for posology. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
{:.h2_title}
## Predicted Entities
``DOSAGE``, ``DRUG``, ``DURATION``, ``FORM``, ``FREQUENCY``, ``ROUTE``, ``STRENGTH``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_en_2.4.4_2.4_1584452534235.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_en_2.4.4_2.4_1584452534235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %}
```python
...
embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_posology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.']], ["text"]))
```
```scala
...
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_posology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val data = Seq("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline.
```bash
+--------------+---------+
|chunk |ner |
+--------------+---------+
|insulin |DRUG |
|Bactrim |DRUG |
|for 14 days |DURATION |
|Fragmin |DRUG |
|5000 units |DOSAGE |
|subcutaneously|ROUTE |
|daily |FREQUENCY|
|Xenaderm |DRUG |
|topically |ROUTE |
|b.i.d |FREQUENCY|
|Lantus |DRUG |
|40 units |DOSAGE |
|subcutaneously|ROUTE |
|at bedtime |FREQUENCY|
|OxyContin |DRUG |
|30 mg |STRENGTH |
|p.o |ROUTE |
|q.12 h |FREQUENCY|
|folic acid |DRUG |
|1 mg |STRENGTH |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_posology_en|
|Type:|ner|
|Compatibility:|Spark NLP 2.4.2|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence,token, embeddings]|
|Output Labels:|[ner]|
|Language:|[en]|
|Case sensitive:|false|
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on the 2018 i2b2 dataset and FDA Drug datasets with ``embeddings_clinical``.
https://open.fda.gov/
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|:--------------|-------:|------:|-----:|---------:|---------:|---------:|
| 0 | B-DRUG | 2639 | 221 | 117 | 0.922727 | 0.957547 | 0.939815 |
| 1 | B-STRENGTH | 1711 | 188 | 87 | 0.901 | 0.951613 | 0.925615 |
| 2 | I-DURATION | 553 | 74 | 60 | 0.881978 | 0.902121 | 0.891935 |
| 3 | I-STRENGTH | 1927 | 239 | 176 | 0.889658 | 0.91631 | 0.902788 |
| 4 | I-FREQUENCY | 1749 | 163 | 133 | 0.914749 | 0.92933 | 0.921982 |
| 5 | B-FORM | 1028 | 109 | 84 | 0.904134 | 0.92446 | 0.914184 |
| 6 | B-DOSAGE | 323 | 71 | 81 | 0.819797 | 0.799505 | 0.809524 |
| 7 | I-DOSAGE | 173 | 89 | 82 | 0.660305 | 0.678431 | 0.669246 |
| 8 | I-DRUG | 1020 | 129 | 118 | 0.887728 | 0.896309 | 0.891998 |
| 9 | I-ROUTE | 17 | 4 | 5 | 0.809524 | 0.772727 | 0.790698 |
| 10 | B-ROUTE | 526 | 49 | 52 | 0.914783 | 0.910035 | 0.912402 |
| 11 | B-DURATION | 223 | 35 | 27 | 0.864341 | 0.892 | 0.877953 |
| 12 | B-FREQUENCY | 1170 | 90 | 54 | 0.928571 | 0.955882 | 0.942029 |
| 13 | I-FORM | 48 | 6 | 11 | 0.888889 | 0.813559 | 0.849558 |
| 14 | Macro-average | 13107 | 1467 | 1087 | 0.870585 | 0.878559 | 0.874554 |
| 15 | Micro-average | 13107 | 1467 | 1087 | 0.899341 | 0.923418 | 0.911221 |
```
---
layout: model
title: Translate English to Pedi Pipeline
author: John Snow Labs
name: translate_en_nso
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, nso, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `nso`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_nso_xx_2.7.0_2.4_1609687739962.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_nso_xx_2.7.0_2.4_1609687739962.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_nso", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_nso", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.nso').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_nso|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering model (from andi611) Squad2 with Neg, Multi, Repeat
author: John Snow Labs
name: distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2-with-ner-with-neg-with-multi-with-repeat` is a English model originally trained by `andi611`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat_en_4.0.0_3.0_1654727440196.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat_en_4.0.0_3.0_1654727440196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_conll.distil_bert.base_uncased_with_neg_with_multi_with_repeat.by_andi611").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/andi611/distilbert-base-uncased-squad2-with-ner-with-neg-with-multi-with-repeat
---
layout: model
title: English DistilBertForQuestionAnswering Small model (from ncduy)
author: John Snow Labs
name: distilbert_qa_base_cased_distilled_squad_finetuned_squad_small
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad-small` is a English model originally trained by `ncduy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_squad_finetuned_squad_small_en_4.0.0_3.0_1654723606084.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_squad_finetuned_squad_small_en_4.0.0_3.0_1654723606084.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_squad_finetuned_squad_small","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_squad_finetuned_squad_small","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_small_cased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_cased_distilled_squad_finetuned_squad_small|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ncduy/distilbert-base-cased-distilled-squad-finetuned-squad-small
---
layout: model
title: Multilabel Classification of Customer Service (Linguistic features)
author: John Snow Labs
name: finmulticlf_customer_service_lin_features
date: 2023-02-03
tags: [en, licensed, finance, classification, customer, linguistic, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MultiClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Multilabel Text Classification model that can help you classify a chat message from customer service according to linguistic features. The classes are the following:
- Q - Colloquial variation
- P - Politeness variation
- W - Offensive language
- K - Keyword language
- B - Basic syntactic structure
- C - Coordinated syntactic structure
- I - Interrogative structure
- M - Morphological variation (plurals, tenses…)
- L - Lexical variation (synonyms)
- E - Expanded abbreviations (I'm -> I am, I'd -> I would…)
- N - Negation
- Z - Noise phenomena like spelling or punctuation errors
## Predicted Entities
`B`, `C`, `E`, `I`, `K`, `L`, `M`, `N`, `P`, `Q`, `W`, `Z`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmulticlf_customer_service_lin_features_en_1.0.0_3.0_1675430237309.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmulticlf_customer_service_lin_features_en_1.0.0_3.0_1675430237309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = nlp.UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = nlp.MultiClassifierDLModel().load("finmulticlf_customer_service_lin_features", "en", "finance/models")\
.setInputCols("sentence_embeddings") \
.setOutputCol("class")
pipeline = nlp.Pipeline().setStages(
[
document_assembler,
embeddings,
docClassifier
]
)
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
light_model = nlp.LightPipeline(model)
result = light_model.annotate("""What do i have to ddo to cancel a Gold account""")
result["class"]
```
## Results
```bash
['Q', 'B', 'L', 'Z', 'I']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finmulticlf_customer_service_lin_features|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|13.0 MB|
## References
https://github.com/bitext/customer-support-intent-detection-training-dataset
## Benchmarking
```bash
label precision recall f1-score support
B 1.00 1.00 1.00 485
C 0.79 0.80 0.80 61
E 0.74 0.89 0.80 44
I 0.95 0.94 0.94 134
K 0.96 0.96 0.96 108
L 0.96 0.97 0.96 402
M 0.93 0.93 0.93 134
N 0.90 0.75 0.82 12
P 0.77 0.90 0.83 30
Q 0.73 0.68 0.71 212
W 0.85 0.88 0.87 33
Z 0.68 0.72 0.70 160
micro-avg 0.90 0.90 0.90 1815
macro-avg 0.85 0.87 0.86 1815
weighted-avg 0.90 0.90 0.90 1815
samples-avg 0.91 0.92 0.90 1815
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Indonesian
author: John Snow Labs
name: opus_mt_en_id
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, id, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `id`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_id_xx_2.7.0_2.4_1609164704175.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_id_xx_2.7.0_2.4_1609164704175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_id", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_id", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.id').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_id|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Stopwords Remover for Norwegian Bokmål language (211 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, nb, open_source]
task: Stop Words Removal
language: nb
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_nb_3.4.1_3.0_1646673078011.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_nb_3.4.1_3.0_1646673078011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","nb") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Bortsett fra å være kongen av nord, er John Snow en engelsk lege og en leder i utviklingen av anestesi og medisinsk hygiene."]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","nb")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Bortsett fra å være kongen av nord, er John Snow en engelsk lege og en leder i utviklingen av anestesi og medisinsk hygiene.").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("nb.stopwords").predict("""Bortsett fra å være kongen av nord, er John Snow en engelsk lege og en leder i utviklingen av anestesi og medisinsk hygiene.""")
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------------------------+
|[Bortsett, kongen, nord, ,, John, Snow, engelsk, lege, utviklingen, anestesi, medisinsk, hygiene, .]|
+----------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|nb|
|Size:|1.9 KB|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from microsoft)
author: John Snow Labs
name: roberta_qa_xdoc_base_squad1.1
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xdoc-base-squad1.1` is a English model originally trained by `microsoft`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_xdoc_base_squad1.1_en_4.3.0_3.0_1674224925472.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_xdoc_base_squad1.1_en_4.3.0_3.0_1674224925472.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_xdoc_base_squad1.1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_xdoc_base_squad1.1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_xdoc_base_squad1.1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|466.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/microsoft/xdoc-base-squad1.1
- https://arxiv.org/abs/2210.02849
---
layout: model
title: Malay T5ForConditionalGeneration Tiny Cased model (from mesolitica)
author: John Snow Labs
name: t5_super_tiny_bahasa_cased
date: 2023-01-31
tags: [ms, open_source, t5, tensorflow]
task: Text Generation
language: ms
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-super-tiny-bahasa-cased` is a Malay model originally trained by `mesolitica`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_super_tiny_bahasa_cased_ms_4.3.0_3.0_1675156057502.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_super_tiny_bahasa_cased_ms_4.3.0_3.0_1675156057502.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_super_tiny_bahasa_cased","ms") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_super_tiny_bahasa_cased","ms")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_super_tiny_bahasa_cased|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|ms|
|Size:|40.5 MB|
## References
- https://huggingface.co/mesolitica/t5-super-tiny-bahasa-cased
- https://github.com/huseinzol05/malaya/tree/master/pretrained-model/t5/prepare
- https://github.com/google-research/text-to-text-transfer-transformer
- https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5
---
layout: model
title: Legal Arguments Mining in Court Decisions (in German)
author: John Snow Labs
name: legclf_argument_mining_german
date: 2023-03-26
tags: [de, licensed, classification, legal, tensorflow]
task: Text Classification
language: de
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Multiclass classification model in German which classifies arguments in legal discourse. These are the following classes: `subsumption`, `definition`, `conclusion`, `other`.
## Predicted Entities
`subsumption`, `definition`, `conclusion`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_argument_mining_german_de_1.0.0_3.0_1679848514704.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_argument_mining_german_de_1.0.0_3.0_1679848514704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_large_german_legal", "de")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
embeddingsSentence = nlp.SentenceEmbeddings()\
.setInputCols(["document", "embeddings"])\
.setOutputCol("sentence_embeddings")\
.setPoolingStrategy("AVERAGE")\
docClassifier = legal.ClassifierDLModel.pretrained("legclf_argument_mining_de", "de", "legal/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
tokenizer,
embeddings,
embeddingsSentence,
docClassifier
])
df = spark.createDataFrame([["Folglich liegt eine Verletzung von Artikel 8 der Konvention vor ."]]).toDF("text")
model = nlpPipeline.fit(df)
result = model.transform(df)
result.select("text", "category.result").show(truncate=False)
```
## Results
```bash
+-----------------------------------------------------------------+------------+
|text |result |
+-----------------------------------------------------------------+------------+
|Folglich liegt eine Verletzung von Artikel 8 der Konvention vor .|[conclusion]|
+-----------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_argument_mining_german|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|de|
|Size:|24.0 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/MeilingShi/legal_argument_mining)
## Benchmarking
```bash
label precision recall f1-score support
conclusion 0.88 0.88 0.88 52
definition 0.83 0.83 0.83 58
other 0.86 0.88 0.87 49
subsumption 0.81 0.80 0.80 64
accuracy - - 0.84 223
macro avg 0.85 0.85 0.85 223
weighted avg 0.84 0.84 0.84 223
```
---
layout: model
title: Match Datetime in Texts
author: John Snow Labs
name: match_datetime
date: 2022-01-04
tags: [en, open_source]
task: Pipeline Public
language: en
nav_key: models
edition: Spark NLP 3.3.4
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
DateMatcher based on yyyy/MM/dd
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/match_datetime_en_3.3.4_3.0_1641310187437.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/match_datetime_en_3.3.4_3.0_1641310187437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline_local = PretrainedPipeline('match_datetime')
tres = pipeline_local.fullAnnotate(input_list)[0]
for dte in tres['date']:
sent = tres['sentence'][int(dte.metadata['sentence'])]
print (f'text/chunk {sent.result[dte.begin:dte.end+1]} | mapped_date: {dte.result}')
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.SparkNLP
SparkNLP.version()
val testData = spark.createDataFrame(Seq( (1, "David visited the restaurant yesterday with his family.
He also visited and the day before, but at that time he was alone.
David again visited today with his colleagues.
He and his friends really liked the food and hoped to visit again tomorrow."))).toDF("id", "text")
val pipeline = PretrainedPipeline("match_datetime", lang="en")
val annotation = pipeline.transform(testData)
annotation.show()
```
## Results
```bash
text/chunk yesterday | mapped_date: 2022/01/02
text/chunk day before | mapped_date: 2022/01/02
text/chunk today | mapped_date: 2022/01/03
text/chunk tomorrow | mapped_date: 2022/01/04
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|match_datetime|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.3.4+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|12.9 KB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- MultiDateMatcher
---
layout: model
title: Word2Vec Embeddings in Romansh (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, rm, open_source]
task: Embeddings
language: rm
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_rm_3.4.1_3.0_1647454087483.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_rm_3.4.1_3.0_1647454087483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","rm") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","rm")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("rm.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|rm|
|Size:|64.6 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: DistilRoBERTa Base Ontonotes NER Pipeline
author: ahmedlone127
name: distilroberta_base_token_classifier_ontonotes_pipeline
date: 2022-06-14
tags: [open_source, ner, token_classifier, distilroberta, ontonotes, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: false
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [distilroberta_base_token_classifier_ontonotes](https://nlp.johnsnowlabs.com/2021/09/26/distilroberta_base_token_classifier_ontonotes_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/distilroberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655219463122.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/distilroberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655219463122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("distilroberta_base_token_classifier_ontonotes_pipeline", lang = "en")
pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.")
```
```scala
val pipeline = new PretrainedPipeline("distilroberta_base_token_classifier_ontonotes_pipeline", lang = "en")
pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020."))
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PERSON |
|John Snow Labs|ORG |
|November 2020 |DATE |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilroberta_base_token_classifier_ontonotes_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Community|
|Language:|en|
|Size:|307.5 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- RoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Legal Now therefore Clause Binary Classifier
author: John Snow Labs
name: legclf_now_therefore_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `now-therefore` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `now-therefore`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_now_therefore_clause_en_1.0.0_3.2_1660122766408.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_now_therefore_clause_en_1.0.0_3.2_1660122766408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[now-therefore]|
|[other]|
|[other]|
|[now-therefore]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_now_therefore_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
now-therefore 0.98 0.98 0.98 58
other 0.99 0.99 0.99 146
accuracy - - 0.99 204
macro-avg 0.99 0.99 0.99 204
weighted-avg 0.99 0.99 0.99 204
```
---
layout: model
title: Chamorro RobertaForQuestionAnswering (from Gantenbein)
author: John Snow Labs
name: roberta_qa_ADDI_CH_RoBERTa
date: 2022-06-20
tags: [open_source, question_answering, roberta]
task: Question Answering
language: ch
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-CH-RoBERTa` is a Chamorro model originally trained by `Gantenbein`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_CH_RoBERTa_ch_4.0.0_3.0_1655726262986.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_CH_RoBERTa_ch_4.0.0_3.0_1655726262986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ADDI_CH_RoBERTa","ch") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_ADDI_CH_RoBERTa","ch")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("ch.answer_question.roberta.ch_tuned.by_Gantenbein").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_ADDI_CH_RoBERTa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|ch|
|Size:|421.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Gantenbein/ADDI-CH-RoBERTa
---
layout: model
title: Swedish BERT Sentence Base Cased Embedding
author: John Snow Labs
name: sent_bert_base_cased
date: 2021-09-06
tags: [swedish, bert_sentence_embeddings, open_source, cased, sv]
task: Embeddings
language: sv
edition: Spark NLP 3.2.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_sv_3.2.2_3.0_1630926268941.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_sv_3.2.2_3.0_1630926268941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "sv") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
```
```scala
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "sv")
.setInputCols("sentence")
.setOutputCol("bert_sentence")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
```
{:.nlu-block}
```python
import nlu
nlu.load("sv.embed_sentence.bert.base_cased").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_base_cased|
|Compatibility:|Spark NLP 3.2.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[bert_sentence]|
|Language:|sv|
|Case sensitive:|true|
## Data Source
The model is imported from: https://huggingface.co/KB/bert-base-swedish-cased
---
layout: model
title: ALBERT Large CoNNL-03 NER Pipeline
author: ahmedlone127
name: albert_large_token_classifier_conll03_pipeline
date: 2022-06-14
tags: [open_source, ner, token_classifier, albert, conll03, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: false
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [albert_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_large_token_classifier_conll03_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/albert_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655211084220.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/albert_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655211084220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs."))
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PER |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_large_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Community|
|Language:|en|
|Size:|64.4 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- AlbertForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Pipeline to Detect Drugs - Generalized Single Entity
author: John Snow Labs
name: ner_drugs_greedy_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, drug, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_drugs_greedy](https://nlp.johnsnowlabs.com/2021/03/31/ner_drugs_greedy_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_pipeline_en_3.4.1_3.0_1647873160931.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_pipeline_en_3.4.1_3.0_1647873160931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_drugs_greedy_pipeline", "en", "clinical/models")
pipeline.annotate("DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.")
```
```scala
val pipeline = new PretrainedPipeline("ner_drugs_greedy_pipeline", "en", "clinical/models")
pipeline.annotate("DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.drugs_greedy.pipeline").predict("""DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.""")
```
## Results
```bash
+-----------------------------------+------------+
| chunk | ner_label |
+-----------------------------------+------------+
| hydrocortisone tablets | DRUG |
| 20 mg to 240 mg of hydrocortisone | DRUG |
+-----------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_drugs_greedy_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Legal Reinstatement Clause Binary Classifier
author: John Snow Labs
name: legclf_reinstatement_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `reinstatement` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `reinstatement`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_reinstatement_clause_en_1.0.0_3.2_1660122901279.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_reinstatement_clause_en_1.0.0_3.2_1660122901279.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[reinstatement]|
|[other]|
|[other]|
|[reinstatement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_reinstatement_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.95 0.97 0.96 101
reinstatement 0.89 0.83 0.86 29
accuracy - - 0.94 130
macro-avg 0.92 0.90 0.91 130
weighted-avg 0.94 0.94 0.94 130
```
---
layout: model
title: Detect Anatomical Structures (Single Entity - embeddings_clinical)
author: John Snow Labs
name: ner_anatomy_coarse
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
An NER model to extract all types of anatomical references in text using "embeddings_clinical" embeddings. It is a single entity model and generalizes all anatomical references to a single entity.
## Predicted Entities
`Anatomy`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_en_3.0.0_3.0_1617209678971.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_en_3.0.0_3.0_1617209678971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_anatomy_coarse", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["content in the lung tissue"]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_anatomy_coarse", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""content in the lung tissue""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.anatomy.coarse").predict("""content in the lung tissue""")
```
## Results
```bash
| | ner_chunk | entity |
|---:|------------------:|----------:|
| 0 | lung tissue | Anatomy |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_anatomy_coarse|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
Trained on a custom dataset using 'embeddings_clinical'.
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|:--------------|------:|------:|------:|---------:|---------:|---------:|
| 0 | B-Anatomy | 2568 | 165 | 158 | 0.939627 | 0.94204 | 0.940832 |
| 1 | I-Anatomy | 1692 | 89 | 169 | 0.950028 | 0.909189 | 0.92916 |
| 2 | Macro-average | 4260 | 254 | 327 | 0.944827 | 0.925614 | 0.935122 |
| 3 | Micro-average | 4260 | 254 | 327 | 0.943731 | 0.928712 | 0.936161 |
```
---
layout: model
title: BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on QQP
author: John Snow Labs
name: sent_bert_wiki_books_qqp
date: 2021-08-31
tags: [en, sentence_embeddings, open_source, wikipedia_dataset, books_corpus_dataset, qqp_dataset]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on QQP. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings.
This model is intended to be used for a variety of English NLP tasks. The pre-training data contains more formal text and the model may not generalize to more colloquial text such as social media or messages.
This model is fine-tuned on the QQP and is recommended for use in semantic similarity of question pairs tasks. The fine-tuning task uses the Quora Question Pairs (QQP) dataset to predict whether two questions are duplicates or not.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_qqp_en_3.2.0_3.0_1630412104798.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_qqp_en_3.2.0_3.0_1630412104798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_qqp", "en") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
```
```scala
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_qqp", "en")
.setInputCols("sentence")
.setOutputCol("bert_sentence")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
sent_embeddings_df = nlu.load('en.embed_sentence.bert.wiki_books_qqp').predict(text, output_level='sentence')
sent_embeddings_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_wiki_books_qqp|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[bert_sentence]|
|Language:|en|
|Case sensitive:|false|
## Data Source
[1]: [Wikipedia dataset](https://dumps.wikimedia.org/)
[2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb)
[3]: [Quora Question Pairs (QQP) dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/qqp/2
---
layout: model
title: Portuguese Legal Roberta Embeddings
author: John Snow Labs
name: roberta_base_portuguese_legal
date: 2023-02-16
tags: [pt, portuguese, embeddings, transformer, open_source, legal, tensorflow]
task: Embeddings
language: pt
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-portuguese-roberta-base` is a Portuguese model originally trained by `joelito`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_portuguese_legal_pt_4.2.4_3.0_1676563448655.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_portuguese_legal_pt_4.2.4_3.0_1676563448655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_base_portuguese_legal|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|pt|
|Size:|415.9 MB|
|Case sensitive:|true|
## References
https://huggingface.co/joelito/legal-portuguese-roberta-base
---
layout: model
title: Fast Neural Machine Translation Model from Korean to English
author: John Snow Labs
name: opus_mt_ko_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, ko, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `ko`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ko_en_xx_2.7.0_2.4_1609168124610.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ko_en_xx_2.7.0_2.4_1609168124610.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ko_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ko_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.ko.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ko_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Multilingual XLMRobertaForTokenClassification Cased model (from nbroad)
author: John Snow Labs
name: xlmroberta_ner_jplu_r_40_lang
date: 2022-08-13
tags: [xx, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `jplu-xlm-r-ner-40-lang` is a Multilingual model originally trained by `nbroad`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jplu_r_40_lang_xx_4.1.0_3.0_1660422549645.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jplu_r_40_lang_xx_4.1.0_3.0_1660422549645.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jplu_r_40_lang","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jplu_r_40_lang","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_jplu_r_40_lang|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|967.7 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/nbroad/jplu-xlm-r-ner-40-lang
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from aszidon)
author: John Snow Labs
name: distilbert_qa_custom4
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom4` is a English model originally trained by `aszidon`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom4_en_4.3.0_3.0_1672774680666.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom4_en_4.3.0_3.0_1672774680666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom4","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom4","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_custom4|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aszidon/distilbertcustom4
---
layout: model
title: English BertForQuestionAnswering model (from LenaSchmidt)
author: John Snow Labs
name: bert_qa_no_need_to_name_this
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `no_need_to_name_this` is a English model orginally trained by `LenaSchmidt`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_no_need_to_name_this_en_4.0.0_3.0_1654188966070.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_no_need_to_name_this_en_4.0.0_3.0_1654188966070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_no_need_to_name_this","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_no_need_to_name_this","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_no_need_to_name_this|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|410.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/LenaSchmidt/no_need_to_name_this
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from navteca)
author: John Snow Labs
name: roberta_qa_navteca_base_squad2
date: 2022-12-02
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `navteca`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_navteca_base_squad2_en_4.2.4_3.0_1669986777716.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_navteca_base_squad2_en_4.2.4_3.0_1669986777716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_navteca_base_squad2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_navteca_base_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_navteca_base_squad2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/navteca/roberta-base-squad2
- https://rajpurkar.github.io/SQuAD-explorer/
---
layout: model
title: Chinese BertForMaskedLM Mini Cased model (from hfl)
author: John Snow Labs
name: bert_embeddings_minirbt_h256
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minirbt-h256` is a Chinese model originally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_minirbt_h256_zh_4.2.4_3.0_1670022628583.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_minirbt_h256_zh_4.2.4_3.0_1670022628583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_minirbt_h256","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_minirbt_h256","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_minirbt_h256|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|39.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/minirbt-h256
- https://github.com/iflytek/MiniRBT
- https://github.com/ymcui/LERT
- https://github.com/ymcui/PERT
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/iflytek/HFL-Anthology
---
layout: model
title: English asr_wav2vec2_xls_r_tf_left_right_shuru TFWav2Vec2ForCTC from hrdipto
author: John Snow Labs
name: pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_tf_left_right_shuru` is a English model originally trained by hrdipto.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru_en_4.2.0_3.0_1664040053317.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru_en_4.2.0_3.0_1664040053317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English BertForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_cased_finetuned_squad_r3f
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-squad-r3f` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_squad_r3f_en_4.0.0_3.0_1657182921335.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_squad_r3f_en_4.0.0_3.0_1657182921335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_squad_r3f","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_squad_r3f","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_cased_finetuned_squad_r3f|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-cased-finetuned-squad-r3f
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from kj141)
author: John Snow Labs
name: distilbert_qa_kj141_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kj141`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kj141_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771876070.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kj141_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771876070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kj141_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kj141_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_kj141_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/kj141/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Legal Indemnification and contribution Clause Binary Classifier (md)
author: John Snow Labs
name: legclf_indemnification_and_contribution_md
date: 2022-11-25
tags: [en, legal, classification, document, agreement, contract, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `indemnification-and-contribution` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `indemnification-and-contribution`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_and_contribution_md_en_1.0.0_3.0_1669376499482.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_and_contribution_md_en_1.0.0_3.0_1669376499482.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[indemnification-and-contribution]|
|[other]|
|[other]|
|[indemnification-and-contribution]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_indemnification_and_contribution_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
precision recall f1-score support
indemnification 0.92 0.96 0.94 25
other 0.97 0.95 0.96 39
accuracy 0.95 64
macro avg 0.95 0.95 0.95 64
weighted avg 0.95 0.95 0.95 64
```
---
layout: model
title: Korean ElectraForQuestionAnswering model (from sehandev)
author: John Snow Labs
name: electra_qa_long
date: 2022-06-22
tags: [ko, open_source, electra, question_answering]
task: Question Answering
language: ko
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-long-qa` is a Korean model originally trained by `sehandev`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_long_ko_4.0.0_3.0_1655922278586.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_long_ko_4.0.0_3.0_1655922278586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_long","ko") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_long","ko")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.answer_question.electra").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_long|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|ko|
|Size:|419.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sehandev/koelectra-long-qa
---
layout: model
title: Legal Whereas Clause Binary Classifier (md)
author: John Snow Labs
name: legclf_whereas_md
date: 2022-11-25
tags: [en, legal, classification, document, agreement, contract, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `whereas` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `whereas`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_whereas_md_en_1.0.0_3.0_1669376529983.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_whereas_md_en_1.0.0_3.0_1669376529983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[whereas]|
|[other]|
|[other]|
|[whereas]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_whereas_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
precision recall f1-score support
other 0.93 1.00 0.96 39
whereas 1.00 0.92 0.96 38
accuracy 0.96 77
macro avg 0.96 0.96 0.96 77
weighted avg 0.96 0.96 0.96 77
```
---
layout: model
title: Translate English to Semitic languages Pipeline
author: John Snow Labs
name: translate_en_sem
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, sem, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `sem`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sem_xx_2.7.0_2.4_1609690194050.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sem_xx_2.7.0_2.4_1609690194050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_sem", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_sem", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.sem').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_sem|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Detect Clinical Entities (ner_jsl_greedy_biobert)
author: John Snow Labs
name: ner_jsl_greedy_biobert_pipeline
date: 2023-03-20
tags: [ner, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_jsl_greedy_biobert](https://nlp.johnsnowlabs.com/2021/08/13/ner_jsl_greedy_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_biobert_pipeline_en_4.3.0_3.2_1679310105776.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_biobert_pipeline_en_4.3.0_3.2_1679310105776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_jsl_greedy_biobert_pipeline", "en", "clinical/models")
text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_jsl_greedy_biobert_pipeline", "en", "clinical/models")
val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.biobert_jsl_greedy.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver])
data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
...
val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val rxnorm_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_rxnorm","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver))
val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
+--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+
| chunk|begin|end| entity| code|confidence| resolutions| codes|
+--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+
| hypertension| 68| 79| PROBLEM| 386165| 0.1567|hypercal:::hypersed:::hypertears:::hyperstat...|386165:::217667::...|
|chronic renal ins...| 83|109| PROBLEM| 218689| 0.1036|nephro calci:::dialysis solutions:::creatini...|218689:::3310:::2...|
| COPD| 113|116| PROBLEM|1539999| 0.1644|broncomar dm:::acne medication:::carbon mono...|1539999:::214981:...|
| gastritis| 120|128| PROBLEM| 225965| 0.1983|gastroflux:::gastroflux oral product:::uceri...|225965:::1176661:...|
| TIA| 136|138| PROBLEM|1089812| 0.0625|thera tears:::thiotepa injection:::nature's ...|1089812:::1660003...|
|a non-ST elevatio...| 182|202| PROBLEM| 218767| 0.1007|non-aspirin pm:::aspirin-free:::non aspirin ...|218767:::215440::...|
|Guaiac positive s...| 208|229| PROBLEM|1294361| 0.0820|anusol rectal product:::anusol hc rectal pro...|1294361:::1166715...|
|cardiac catheteri...| 295|317| TEST| 385247| 0.1566|cardiacap:::cardiology pack:::cardizem:::car...|385247:::545063::...|
| PTCA| 324|327|TREATMENT| 8410| 0.0867|alteplase:::reteplase:::pancuronium:::tripe ...|8410:::76895:::78...|
| mid LAD lesion| 332|345| PROBLEM| 151672| 0.0549|dulcolax:::lazerformalyde:::linaclotide:::du...|151672:::217985::...|
+--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---------------|---------------------|
| Name: | sbiobertresolve_rxnorm |
| Type: | SentenceEntityResolverModel |
| Compatibility: | Spark NLP 2.6.4 + |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [ner_chunk, chunk_embeddings] |
|Output labels: | [resolution] |
| Language: | en |
| Dependencies: | sbiobert_base_cased_mli |
{:.h2_title}
## Data Source
Trained on November 2020 RxNorm Clinical Drugs ontology graph with ``sbiobert_base_cased_mli`` embeddings.
https://www.nlm.nih.gov/pubs/techbull/nd20/brief/nd20_rx_norm_november_release.html
---
layout: model
title: English asr_wav2vec2_cetuc_sid_voxforge_mls_1 TFWav2Vec2ForCTC from joaoalvarenga
author: John Snow Labs
name: asr_wav2vec2_cetuc_sid_voxforge_mls_1
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_cetuc_sid_voxforge_mls_1` is a English model originally trained by joaoalvarenga.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_cetuc_sid_voxforge_mls_1_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_cetuc_sid_voxforge_mls_1_en_4.2.0_3.0_1664023215604.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_cetuc_sid_voxforge_mls_1_en_4.2.0_3.0_1664023215604.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_cetuc_sid_voxforge_mls_1", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_cetuc_sid_voxforge_mls_1", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_cetuc_sid_voxforge_mls_1|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Multilingual BertForQuestionAnswering Cased model (from roshnir)
author: John Snow Labs
name: bert_qa_mbert_finetuned_mlqa_ar_hi_dev
date: 2022-07-07
tags: [xx, open_source, bert, question_answering]
task: Question Answering
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-ar-hi` is a Multilingual model originally trained by `roshnir`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_ar_hi_dev_xx_4.0.0_3.0_1657189808354.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_ar_hi_dev_xx_4.0.0_3.0_1657189808354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_ar_hi_dev","xx") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_ar_hi_dev","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_mbert_finetuned_mlqa_ar_hi_dev|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|xx|
|Size:|626.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-ar-hi
---
layout: model
title: Spanish RobertaForQuestionAnswering Base Cased model (from nlp-en-es)
author: John Snow Labs
name: roberta_qa_nlp_en_es_base_bne_finetuned_s_c
date: 2022-12-02
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-finetuned-sqac` is a Spanish model originally trained by `nlp-en-es`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_nlp_en_es_base_bne_finetuned_s_c_es_4.2.4_3.0_1669985738926.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_nlp_en_es_base_bne_finetuned_s_c_es_4.2.4_3.0_1669985738926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_nlp_en_es_base_bne_finetuned_s_c","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_nlp_en_es_base_bne_finetuned_s_c","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_nlp_en_es_base_bne_finetuned_s_c|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|460.2 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/nlp-en-es/roberta-base-bne-finetuned-sqac
---
layout: model
title: Irish Lemmatizer
author: John Snow Labs
name: lemma
date: 2020-07-29 23:34:00 +0800
task: Lemmatization
language: ga
edition: Spark NLP 2.5.5
spark_version: 2.4
tags: [lemmatizer, ga]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ga_2.5.5_2.4_1596054397576.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ga_2.5.5_2.4_1596054397576.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
lemmatizer = LemmatizerModel.pretrained("lemma", "ga") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine.")
```
```scala
...
val lemmatizer = LemmatizerModel.pretrained("lemma", "ga")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine."""]
lemma_df = nlu.load('ga.lemma').predict(text, output_level='document')
lemma_df.lemma.values[0]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=6, result='Seachas', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=8, end=8, result='a', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=10, end=15, result='bheith', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=17, end=19, result='i', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=21, end=22, result='rí', metadata={'sentence': '0'}, embeddings=[]),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Type:|lemmatizer|
|Compatibility:|Spark NLP 2.5.5+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[lemma]|
|Language:|ga|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: English DistilBertForQuestionAnswering model (from hcy11)
author: John Snow Labs
name: distilbert_qa_hcy11_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hcy11`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hcy11_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725400512.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hcy11_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725400512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hcy11_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hcy11_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_hcy11").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_hcy11_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/hcy11/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC5CDR_Chem_Modified_BioBERT_512
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC5CDR-Chem-Modified-BioBERT-512` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_BioBERT_512_en_4.0.0_3.0_1657109320849.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_BioBERT_512_en_4.0.0_3.0_1657109320849.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_BioBERT_512","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_BioBERT_512","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC5CDR_Chem_Modified_BioBERT_512|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|403.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC5CDR-Chem-Modified-BioBERT-512
---
layout: model
title: Legal Asset Purchase Agreement Document Classifier (Longformer)
author: John Snow Labs
name: legclf_asset_purchase_agreement
date: 2022-11-10
tags: [en, legal, classification, licensed, agreement]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_asset_purchase_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `asset-purchase-agreement` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`asset-purchase-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_asset_purchase_agreement_en_1.0.0_3.0_1668104092936.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_asset_purchase_agreement_en_1.0.0_3.0_1668104092936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[asset-purchase-agreement]|
|[other]|
|[other]|
|[asset-purchase-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_asset_purchase_agreement|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.5 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
asset-purchase-agreement 0.96 0.96 0.96 27
other 0.99 0.99 0.99 85
accuracy - - 0.98 112
macro-avg 0.98 0.98 0.98 112
weighted-avg 0.98 0.98 0.98 112
```
---
layout: model
title: Finnish XLMRobertaForTokenClassification Base Cased model (from tner)
author: John Snow Labs
name: xlmroberta_ner_base_fin
date: 2022-08-13
tags: [fi, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: fi
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-fin` is a Finnish model originally trained by `tner`.
## Predicted Entities
`other`, `person`, `location`, `organization`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_fin_fi_4.1.0_3.0_1660426752654.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_fin_fi_4.1.0_3.0_1660426752654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_fin","fi") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_fin","fi")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_fin|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|fi|
|Size:|773.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/tner/xlm-roberta-base-fin
- https://github.com/asahi417/tner
---
layout: model
title: Ukrainian T5ForConditionalGeneration Cased model (from ukr-models)
author: John Snow Labs
name: t5_uk_summarizer
date: 2023-01-31
tags: [uk, open_source, t5, tensorflow]
task: Text Generation
language: uk
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `uk-summarizer` is a Ukrainian model originally trained by `ukr-models`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_uk_summarizer_uk_4.3.0_3.0_1675157739525.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_uk_summarizer_uk_4.3.0_3.0_1675157739525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_uk_summarizer","uk") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_uk_summarizer","uk")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_uk_summarizer|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|uk|
|Size:|995.5 MB|
## References
- https://huggingface.co/ukr-models/uk-summarizer
---
layout: model
title: Translate English to Chinese Pipeline
author: John Snow Labs
name: translate_en_zh
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, zh, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `zh`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_zh_xx_2.7.0_2.4_1609686009785.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_zh_xx_2.7.0_2.4_1609686009785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_zh", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_zh", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.zh').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_zh|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Appointment Clause Binary Classifier
author: John Snow Labs
name: legclf_appointment_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `appointment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `appointment`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_appointment_clause_en_1.0.0_3.2_1660122120413.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_appointment_clause_en_1.0.0_3.2_1660122120413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[appointment]|
|[other]|
|[other]|
|[appointment]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_appointment_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
appointment 0.86 0.94 0.90 32
other 0.98 0.95 0.96 101
accuracy - - 0.95 133
macro-avg 0.92 0.94 0.93 133
weighted-avg 0.95 0.95 0.95 133
```
---
layout: model
title: Japanese Bert Embeddings (Base, Character Tokenization, Whole Word Masking)
author: John Snow Labs
name: bert_embeddings_bert_base_japanese_char_whole_word_masking
date: 2022-04-11
tags: [bert, embeddings, ja, open_source]
task: Embeddings
language: ja
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-char-whole-word-masking` is a Japanese model orginally trained by `cl-tohoku`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_whole_word_masking_ja_3.4.2_3.0_1649674360241.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_whole_word_masking_ja_3.4.2_3.0_1649674360241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char_whole_word_masking","ja") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char_whole_word_masking","ja")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("私はSpark NLPを愛しています").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ja.embed.bert_base_japanese_char_whole_word_masking").predict("""私はSpark NLPを愛しています""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_japanese_char_whole_word_masking|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ja|
|Size:|334.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking
- https://github.com/google-research/bert
- https://github.com/cl-tohoku/bert-japanese/tree/v1.0
- https://github.com/attardi/wikiextractor
- https://taku910.github.io/mecab/
- https://creativecommons.org/licenses/by-sa/3.0/
- https://www.tensorflow.org/tfrc/
---
layout: model
title: Fast Neural Machine Translation Model from English to Tonga (Zambezi)
author: John Snow Labs
name: opus_mt_en_toi
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, toi, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `toi`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_toi_xx_2.7.0_2.4_1609166942717.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_toi_xx_2.7.0_2.4_1609166942717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_toi", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_toi", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.toi').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_toi|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Warranty Clause Binary Classifier (CUAD dataset)
author: John Snow Labs
name: legclf_cuad_warranty_clause
date: 2022-10-18
tags: [warranty, clause, en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `warranty` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
There are other models in this dataset with similar title, but the difference is the dataset it was trained on. This one was trained with `cuad` dataset.
## Predicted Entities
`warranty`, `other`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_warranty_clause_en_1.0.0_3.0_1666097671097.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_warranty_clause_en_1.0.0_3.0_1666097671097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[warranty]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_cuad_warranty_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.6 MB|
## References
CUAD dataset
## Benchmarking
```bash
label precision recall f1-score support
other 0.96 0.96 0.96 27
warranty 0.93 0.93 0.93 14
accuracy - - 0.95 41
macro-avg 0.95 0.95 0.95 41
weighted-avg 0.95 0.95 0.95 41
```
---
layout: model
title: Pipeline to Detect PHI for Deidentification
author: John Snow Labs
name: ner_deid_augmented_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, deidentification, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_augmented](https://nlp.johnsnowlabs.com/2021/03/31/ner_deid_augmented_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_pipeline_en_3.4.1_3.0_1647864550318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_pipeline_en_3.4.1_3.0_1647864550318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_augmented_pipeline", "en", "clinical/models")
pipeline.annotate("HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.")
```
```scala
val pipeline = new PretrainedPipeline("ner_deid_augmented_pipeline", "en", "clinical/models")
pipeline.annotate("HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.ner_augmented.pipeline").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""")
```
## Results
```bash
+---------------+---------+
|chunk |ner_label|
+---------------+---------+
|Smith |NAME |
|VA Hospital |LOCATION |
|John Green |NAME |
|2347165768 |ID |
|Day Hospital |LOCATION |
|02/04/2003 |DATE |
|Smith |NAME |
|Day Hospital |LOCATION |
|Smith |NAME |
|Smith |NAME |
|7 Ardmore Tower|LOCATION |
|Hart |NAME |
|Smith |NAME |
|02/07/2003 |DATE |
+---------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_augmented_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from shahrukhx01)
author: John Snow Labs
name: roberta_qa_base_squad2_boolq_baseline
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-boolq-baseline` is a English model originally trained by `shahrukhx01`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_boolq_baseline_en_4.3.0_3.0_1674219076102.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_boolq_baseline_en_4.3.0_3.0_1674219076102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_boolq_baseline","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_boolq_baseline","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_squad2_boolq_baseline|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/shahrukhx01/roberta-base-squad2-boolq-baseline
---
layout: model
title: Translate Seychellois Creole to English Pipeline
author: John Snow Labs
name: translate_crs_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, crs, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `crs`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_crs_en_xx_2.7.0_2.4_1609690310342.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_crs_en_xx_2.7.0_2.4_1609690310342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_crs_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_crs_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.crs.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_crs_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_cline_emanuals_tech
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cline-emanuals-techqa` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cline_emanuals_tech_en_4.3.0_3.0_1674209326690.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cline_emanuals_tech_en_4.3.0_3.0_1674209326690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cline_emanuals_tech","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cline_emanuals_tech","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_cline_emanuals_tech|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|466.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/cline-emanuals-techqa
---
layout: model
title: Vietnamese DistilBERT Base Cased Embeddings
author: John Snow Labs
name: distilbert_base_cased
date: 2022-01-13
tags: [embeddings, distilbert, vietnamese, vi, open_source]
task: Embeddings
language: vi
edition: Spark NLP 3.3.4
spark_version: 3.0
supported: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This embeddings model was imported from `Hugging Face`. It's a custom version of `distilbert_base_multilingual_cased` and it gives the same representations produced by the original model which preserves the original accuracy.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_cased_vi_3.3.4_3.0_1642064850307.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_cased_vi_3.3.4_3.0_1642064850307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
distilbert = DistilBertEmbeddings.pretrained("distilbert_base_cased", "vi")\
.setInputCols(["sentence",'token'])\
.setOutputCol("embeddings")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, distilbert])
text = """Tôi yêu Spark NLP"""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
...
val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "vi")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("Tôi yêu Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("vi.embed.distilbert.cased").predict("""Tôi yêu Spark NLP""")
```
## Results
```bash
+-----+--------------------+
|token| embeddings|
+-----+--------------------+
| Tôi|[-0.38760236, -0....|
| yêu|[-0.3357051, -0.5...|
|Spark|[-0.20642707, -0....|
| NLP|[-0.013280544, -0...|
+-----+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_base_cased|
|Compatibility:|Spark NLP 3.3.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|vi|
|Size:|211.6 MB|
|Case sensitive:|false|
---
layout: model
title: French CamemBert Embeddings (from Sonny)
author: John Snow Labs
name: camembert_embeddings_Sonny_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Sonny`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Sonny_generic_model_fr_3.4.4_3.0_1653986957057.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Sonny_generic_model_fr_3.4.4_3.0_1653986957057.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Sonny_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Sonny_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_Sonny_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Sonny/dummy-model
---
layout: model
title: French CamemBert Embeddings (from elusive-magnolia)
author: John Snow Labs
name: camembert_embeddings_elusive_magnolia_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `elusive-magnolia`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_elusive_magnolia_generic_model_fr_3.4.4_3.0_1653988370340.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_elusive_magnolia_generic_model_fr_3.4.4_3.0_1653988370340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_elusive_magnolia_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_elusive_magnolia_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_elusive_magnolia_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/elusive-magnolia/dummy-model
---
layout: model
title: Translate English to Tuvaluan Pipeline
author: John Snow Labs
name: translate_en_tvl
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, tvl, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `tvl`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tvl_xx_2.7.0_2.4_1609686360411.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tvl_xx_2.7.0_2.4_1609686360411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_tvl", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_tvl", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.tvl').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_tvl|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Translate Basque to English Pipeline
author: John Snow Labs
name: translate_eu_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, eu, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `eu`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_eu_en_xx_2.7.0_2.4_1609686527199.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_eu_en_xx_2.7.0_2.4_1609686527199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_eu_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_eu_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.eu.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_eu_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering (from AlirezaBaneshi)
author: John Snow Labs
name: roberta_qa_autotrain_test2_756523213
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-test2-756523213` is a English model originally trained by `AlirezaBaneshi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523213_en_4.0.0_3.0_1655727630639.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523213_en_4.0.0_3.0_1655727630639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523213","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_autotrain_test2_756523213","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.roberta.756523213.by_AlirezaBaneshi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_autotrain_test2_756523213|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|415.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AlirezaBaneshi/autotrain-test2-756523213
---
layout: model
title: Fast Neural Machine Translation Model from English to French
author: John Snow Labs
name: opus_mt_en_fr
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, fr, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `fr`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_fr_xx_2.7.0_2.4_1609166836357.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_fr_xx_2.7.0_2.4_1609166836357.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_fr", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_fr", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.fr').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_fr|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Detect Living Species (w2v_cc_300d)
author: John Snow Labs
name: ner_living_species_pipeline
date: 2023-03-13
tags: [gl, ner, clinical, licensed]
task: Named Entity Recognition
language: gl
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_living_species](https://nlp.johnsnowlabs.com/2022/06/23/ner_living_species_gl_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_gl_4.3.0_3.2_1678704830024.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_gl_4.3.0_3.2_1678704830024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_living_species_pipeline", "gl", "clinical/models")
text = '''Muller de 45 anos, sen antecedentes médicos de interese, que foi remitida á consulta de dermatoloxía de urxencias por lesións faciales de tres semanas de evolución. A paciente non presentaba lesións noutras localizaciones nin outra clínica de interese. No seu centro de saúde prescribíronlle corticoides tópicos ante a sospeita de picaduras de artrópodos e unha semana despois, antivirales orais baixo o diagnóstico de posible infección herpética. As lesións interferían de forma notable na súa vida persoal e profesional xa que traballaba de face ao púbico. Unha semana máis tarde o diagnóstico foi confirmado ao resultar o cultivo positivo a Staphylococcus aureus.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_living_species_pipeline", "gl", "clinical/models")
val text = "Muller de 45 anos, sen antecedentes médicos de interese, que foi remitida á consulta de dermatoloxía de urxencias por lesións faciales de tres semanas de evolución. A paciente non presentaba lesións noutras localizaciones nin outra clínica de interese. No seu centro de saúde prescribíronlle corticoides tópicos ante a sospeita de picaduras de artrópodos e unha semana despois, antivirales orais baixo o diagnóstico de posible infección herpética. As lesións interferían de forma notable na súa vida persoal e profesional xa que traballaba de face ao púbico. Unha semana máis tarde o diagnóstico foi confirmado ao resultar o cultivo positivo a Staphylococcus aureus."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:----------------------|--------:|------:|:------------|-------------:|
| 0 | Muller | 0 | 5 | HUMAN | 0.9998 |
| 1 | paciente | 167 | 174 | HUMAN | 0.9985 |
| 2 | artrópodos | 344 | 353 | SPECIES | 0.9647 |
| 3 | antivirales | 378 | 388 | SPECIES | 0.8854 |
| 4 | herpética | 437 | 445 | SPECIES | 0.9592 |
| 5 | púbico | 551 | 556 | HUMAN | 0.7293 |
| 6 | Staphylococcus aureus | 644 | 664 | SPECIES | 0.87005 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_living_species_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|gl|
|Size:|794.9 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Hindi BertForMaskedLM Cased model (from neuralspace-reverie)
author: John Snow Labs
name: bert_embeddings_indic_transformers
date: 2022-12-02
tags: [hi, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: hi
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-hi-bert` is a Hindi model originally trained by `neuralspace-reverie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_hi_4.2.4_3.0_1670022367639.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_hi_4.2.4_3.0_1670022367639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","hi") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","hi")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_indic_transformers|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|hi|
|Size:|612.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/neuralspace-reverie/indic-transformers-hi-bert
- https://oscar-corpus.com/
---
layout: model
title: Ukrainian DistilBERT Embeddings (from Geotrend)
author: John Snow Labs
name: distilbert_embeddings_distilbert_base_uk_cased
date: 2022-04-12
tags: [distilbert, embeddings, uk, open_source]
task: Embeddings
language: uk
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-uk-cased` is a Ukrainian model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_uk_cased_uk_3.4.2_3.0_1649783949701.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_uk_cased_uk_3.4.2_3.0_1649783949701.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_uk_cased","uk") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_uk_cased","uk")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Я люблю Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("uk.embed.distilbert_base_cased").predict("""Я люблю Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_distilbert_base_uk_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|uk|
|Size:|195.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/distilbert-base-uk-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Lemmatizer (Luxembourgish, SpacyLookup)
author: John Snow Labs
name: lemma_spacylookup
date: 2022-03-03
tags: [open_source, lemmatizer, lb]
task: Lemmatization
language: lb
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Luxembourgish Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_lb_3.4.1_3.0_1646316561258.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_lb_3.4.1_3.0_1646316561258.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","lb") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["Dir sidd net besser wéi ech"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","lb")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer))
val data = Seq("Dir sidd net besser wéi ech").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("lb.lemma").predict("""Dir sidd net besser wéi ech""")
```
## Results
```bash
+--------------------------------------+
|result |
+--------------------------------------+
|[dir, sidd, net, besseren, wéien, ech]|
+--------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma_spacylookup|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|lb|
|Size:|3.9 MB|
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18)
author: John Snow Labs
name: distilbert_qa_base_uncased_becasv2_3
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becasv2-3` is a English model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_3_en_4.3.0_3.0_1672767723393.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_3_en_4.3.0_3.0_1672767723393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_3","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_becasv2_3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Evelyn18/distilbert-base-uncased-becasv2-3
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from bdickson)
author: John Snow Labs
name: distilbert_qa_bdickson_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `bdickson`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bdickson_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770181883.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bdickson_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770181883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bdickson_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bdickson_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_bdickson_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/bdickson/distilbert-base-uncased-finetuned-squad
---
layout: model
title: German Bert Embeddings(Cased)
author: John Snow Labs
name: bert_embeddings_bert_base_german_dbmdz_cased
date: 2022-04-11
tags: [bert, embeddings, de, open_source]
task: Embeddings
language: de
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-german-dbmdz-cased` is a German model orginally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_dbmdz_cased_de_3.4.2_3.0_1649676089568.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_dbmdz_cased_de_3.4.2_3.0_1649676089568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_dbmdz_cased","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_dbmdz_cased","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Funken NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.embed.bert_base_german_dbmdz_cased").predict("""Ich liebe Funken NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_german_dbmdz_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|412.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/bert-base-german-dbmdz-cased
---
layout: model
title: Translate Estonian to English Pipeline
author: John Snow Labs
name: translate_et_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, et, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `et`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_et_en_xx_2.7.0_2.4_1609699041974.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_et_en_xx_2.7.0_2.4_1609699041974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_et_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_et_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.et.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_et_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav TFWav2Vec2ForCTC from vai6hav
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav` is a English model originally trained by vai6hav.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav_en_4.2.0_3.0_1664113051189.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav_en_4.2.0_3.0_1664113051189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-8` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8_en_4.0.0_3.0_1654191650975.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8_en_4.0.0_3.0_1654191650975.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.span_bert.base_cased_512d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|387.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-8
---
layout: model
title: Sentence Entity Resolver for Snomed Concepts, Body Structure Version (sbiobert_base_cased_mli)
author: John Snow Labs
name: sbiobertresolve_snomed_bodyStructure
date: 2021-06-15
tags: [snomed, en, clinical, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.1.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical (anatomical structures) entities to Snomed codes (body structure version) using sentence embeddings.
## Predicted Entities
Snomed Codes and their normalized definition with `sbiobert_base_cased_mli` embeddings.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_bodyStructure_en_3.1.0_3.0_1623774132614.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_bodyStructure_en_3.1.0_3.0_1623774132614.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_snomed_bodyStructure``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Disease_Syndrome_Disorder,
External_body_part_or_region``` set in ```.setWhiteList()```.
```sbiobertresolve_snomed_bodyStructure``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. No need to set ```.setWhiteList()```.
Merge ner_jsl and ner_anatomy_coarse model chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
jsl_sbert_embedder = BertSentenceEmbeddings\
.pretrained('sbiobert_base_cased_mli','en','clinical/models')\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
snomed_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_snomed_bodyStructure, "en", "clinical/models) \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("snomed_code")
snomed_pipelineModel = PipelineModel(
stages = [
documentAssembler,
jsl_sbert_embedder,
snomed_resolver])
snomed_lp = LightPipeline(snomed_pipelineModel)
result = snomed_lp.fullAnnotate("Amputation stump")
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val snomed_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_snomed_bodyStructure", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("snomed_code")
val snomed_pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, snomed_resolver))
val snomed_lp = LightPipeline(snomed_pipelineModel)
val result = snomed_lp.fullAnnotate("Amputation stump")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.snomed_body_structure").predict("""sbiobertresolve_snomed_bodyStructure, """)
```
## Results
```bash
| | chunks | code | resolutions | all_codes | all_distances |
|---:|:-----------------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------|
| 0 | amputation stump | 38033009 | [Amputation stump, Amputation stump of upper limb, Amputation stump of left upper limb, Amputation stump of lower limb, Amputation stump of left lower limb, Amputation stump of right upper limb, Amputation stump of right lower limb, ...]| ['38033009', '771359009', '771364008', '771358001', '771367001', '771365009', '771368006', ...] | ['0.0000', '0.0773', '0.0858', '0.0863', '0.0905', '0.0911', '0.0972', ...] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_snomed_bodyStructure|
|Compatibility:|Healthcare NLP 3.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[snomed_code]|
|Language:|en|
|Case sensitive:|true|
## Data Source
https://www.snomed.org/
---
layout: model
title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011)
author: John Snow Labs
name: distilbert_token_classifier_autotrain_ner_778023879
date: 2023-03-03
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-ner-778023879` is a English model originally trained by `Lucifermorningstar011`.
## Predicted Entities
`9`, `0`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_ner_778023879_en_4.3.1_3.0_1677881870073.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_ner_778023879_en_4.3.1_3.0_1677881870073.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_ner_778023879","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_ner_778023879","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_autotrain_ner_778023879|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|244.1 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Lucifermorningstar011/autotrain-ner-778023879
---
layout: model
title: Pipeline to Extract Negation and Uncertainty Entities from Spanish Medical Texts (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_negation_uncertainty_pipeline
date: 2023-03-20
tags: [es, clinical, licensed, token_classification, bert, ner, negation, uncertainty, linguistics]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_negation_uncertainty](https://nlp.johnsnowlabs.com/2022/08/11/bert_token_classifier_negation_uncertainty_es_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_negation_uncertainty_pipeline_es_4.3.0_3.2_1679298806721.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_negation_uncertainty_pipeline_es_4.3.0_3.2_1679298806721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_negation_uncertainty_pipeline", "es", "clinical/models")
text = '''Con diagnóstico probable de cirrosis hepática (no conocida previamente) y peritonitis espontanea primaria con tratamiento durante 8 dias con ceftriaxona en el primer ingreso (no se realizó paracentesis control por escasez de liquido). Lesión tumoral en hélix izquierdo de 0,5 cms. de diámetro susceptible de ca basocelular perlado.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_negation_uncertainty_pipeline", "es", "clinical/models")
val text = "Con diagnóstico probable de cirrosis hepática (no conocida previamente) y peritonitis espontanea primaria con tratamiento durante 8 dias con ceftriaxona en el primer ingreso (no se realizó paracentesis control por escasez de liquido). Lesión tumoral en hélix izquierdo de 0,5 cms. de diámetro susceptible de ca basocelular perlado."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:-------------------------------------------------------|--------:|------:|:------------|-------------:|
| 0 | probable | 16 | 23 | UNC | 0.999994 |
| 1 | de cirrosis hepática | 25 | 44 | USCO | 0.999988 |
| 2 | no | 47 | 48 | NEG | 0.999995 |
| 3 | conocida previamente | 50 | 69 | NSCO | 0.999992 |
| 4 | no | 175 | 176 | NEG | 0.999995 |
| 5 | se realizó paracentesis control por escasez de liquido | 178 | 231 | NSCO | 0.999995 |
| 6 | susceptible de | 293 | 306 | UNC | 0.999986 |
| 7 | ca basocelular perlado | 308 | 329 | USCO | 0.99999 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_negation_uncertainty_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|es|
|Size:|410.6 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18)
author: John Snow Labs
name: distilbert_qa_base_uncased_becas_3
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-3` is a English model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_3_en_4.3.0_3.0_1672767491512.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_3_en_4.3.0_3.0_1672767491512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_3","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_becas_3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-3
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from k3nneth)
author: John Snow Labs
name: xlmroberta_ner_k3nneth_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `k3nneth`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_k3nneth_base_finetuned_panx_de_4.1.0_3.0_1660434928587.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_k3nneth_base_finetuned_panx_de_4.1.0_3.0_1660434928587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_k3nneth_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_k3nneth_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_k3nneth_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/k3nneth/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Legal The closing Clause Binary Classifier
author: John Snow Labs
name: legclf_the_closing_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `the-closing` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `the-closing`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_the_closing_clause_en_1.0.0_3.2_1660123095618.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_the_closing_clause_en_1.0.0_3.2_1660123095618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[the-closing]|
|[other]|
|[other]|
|[the-closing]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_the_closing_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 1.00 1.00 105
the-closing 1.00 1.00 1.00 35
accuracy - - 1.00 140
macro-avg 1.00 1.00 1.00 140
weighted-avg 1.00 1.00 1.00 140
```
---
layout: model
title: Clinical Portuguese Bert Embeddiongs (Clinical)
author: John Snow Labs
name: biobert_embeddings_clinical
date: 2022-04-11
tags: [biobert, embeddings, pt, open_source]
task: Embeddings
language: pt
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BioBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `biobertpt-clin` is a Portuguese model orginally trained by `pucpr`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_embeddings_clinical_pt_3.4.2_3.0_1649686994929.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_embeddings_clinical_pt_3.4.2_3.0_1649686994929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("biobert_embeddings_clinical","pt") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Odeio o cancro"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("biobert_embeddings_clinical","pt")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Odeio o cancro").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("pt.embed.gs_clinical").predict("""Odeio o cancro""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|biobert_embeddings_clinical|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|pt|
|Size:|667.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/pucpr/biobertpt-clin
- https://aclanthology.org/2020.clinicalnlp-1.7/
- https://github.com/HAILab-PUCPR/BioBERTpt
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-42` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42_en_4.0.0_3.0_1654191455684.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42_en_4.0.0_3.0_1654191455684.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.span_bert.base_cased_seed_42").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|380.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-42
---
layout: model
title: English asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu TFWav2Vec2ForCTC from adelgalu
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu` is a English model originally trained by adelgalu.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu_en_4.2.0_3.0_1664098872120.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu_en_4.2.0_3.0_1664098872120.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|349.3 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English BertForQuestionAnswering model (from maroo93)
author: John Snow Labs
name: bert_qa_squad1.1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad1.1` is a English model orginally trained by `maroo93`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad1.1_en_4.0.0_3.0_1654192132045.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad1.1_en_4.0.0_3.0_1654192132045.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad1.1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_squad1.1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.by_maroo93").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_squad1.1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/maroo93/squad1.1
---
layout: model
title: Pretrained Pipeline for Few-NERD-General NER Model
author: John Snow Labs
name: nerdl_fewnerd_100d_pipeline
date: 2022-06-28
tags: [fewnerd, nerdl, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on Few-NERD model and it detects :
`PERSON`, `ORGANIZATION`, `LOCATION`, `ART`, `BUILDING`, `PRODUCT`, `EVENT`, `OTHER`
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_FEW_NERD/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FewNERD.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_100d_pipeline_en_4.0.0_3.0_1656388980361.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_100d_pipeline_en_4.0.0_3.0_1656388980361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
fewnerd_pipeline = PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en")
fewnerd_pipeline.annotate("""The Double Down is a sandwich offered by Kentucky Fried Chicken restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).""")
```
```scala
val pipeline = new PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en")
val result = pipeline.fullAnnotate("The Double Down is a sandwich offered by Kentucky Fried Chicken restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).")(0)
```
## Results
```bash
+-----------------------+------------+
|chunk |ner_label |
+-----------------------+------------+
|Kentucky Fried Chicken |ORGANIZATION|
|Anglo-Egyptian War |EVENT |
|battle of Tell El Kebir|EVENT |
|Egypt Medal |OTHER |
|Order of Medjidie |OTHER |
+-----------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|nerdl_fewnerd_100d_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|167.3 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- NerDLModel
- NerConverter
- Finisher
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from Ching)
author: John Snow Labs
name: roberta_qa_negation_detector
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `negation_detector` is a English model originally trained by `Ching`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_negation_detector_en_4.3.0_3.0_1674211601485.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_negation_detector_en_4.3.0_3.0_1674211601485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_negation_detector","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_negation_detector","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_negation_detector|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Ching/negation_detector
---
layout: model
title: Extract test entities (Voice of the Patients)
author: John Snow Labs
name: ner_vop_test_wip
date: 2023-04-20
tags: [licensed, clinical, en, ner, vop, patient, test]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts test mentions from the documents transferred from the patient’s own sentences.
Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases.
## Predicted Entities
`Measurements`, `TestResult`, `Test`, `VitalTest`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_wip_en_4.4.0_3.0_1682013044617.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_wip_en_4.4.0_3.0_1682013044617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_test_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I"m on medication to manage it."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_test_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I"m on medication to manage it.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:---------------|:------------|
| thyroid levels | Test |
| blood test | Test |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_test_wip|
|Compatibility:|Healthcare NLP 4.4.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.9 MB|
|Dependencies:|embeddings_clinical|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Measurements 100 58 30 130 0.63 0.77 0.69
TestResult 452 124 182 634 0.78 0.71 0.75
Test 1194 98 207 1401 0.92 0.85 0.89
VitalTest 195 20 23 218 0.91 0.89 0.90
macro_avg 1941 300 442 2383 0.81 0.80 0.81
micro_avg 1941 300 442 2383 0.87 0.81 0.84
```
---
layout: model
title: Sentiment Analysis of IMDB Reviews Pipeline (analyze_sentimentdl_use_imdb)
author: John Snow Labs
name: analyze_sentimentdl_use_imdb
date: 2021-01-15
task: [Embeddings, Sentiment Analysis, Pipeline Public]
language: en
nav_key: models
edition: Spark NLP 2.7.1
spark_version: 2.4
tags: [en, pipeline, sentiment]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pre-trained pipeline to classify IMDB reviews in `neg` and `pos` classes using `tfhub_use` embeddings.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_use_imdb_en_2.7.1_2.4_1610723836151.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_use_imdb_en_2.7.1_2.4_1610723836151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("analyze_sentimentdl_use_imdb", lang = "en")
result = pipeline.fullAnnotate("Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("analyze_sentimentdl_use_imdb", lang = "en")
val result = pipeline.fullAnnotate("Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!")
```
{:.nlu-block}
```python
import nlu
text = ["""Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!"""]
sentiment_df = nlu.load('en.sentiment.imdb.use').predict(text, output_level='sentence')
sentiment_df
```
## Results
```bash
| | document | sentiment |
|---:|---------------------------------------------------------------------------------------------------------:|--------------:|
| | Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the | |
| 0 | film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music | positive |
| | was rad! Horror and sword fight freaks,buy this movie now! | |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|analyze_sentimentdl_use_imdb|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.1+|
|Edition:|Official|
|Language:|en|
## Included Models
`tfhub_use`, `sentimentdl_use_imdb`
---
layout: model
title: Legal Deposit Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_deposit_agreement_bert
date: 2022-12-06
tags: [en, legal, classification, agreement, deposit, licensed, bert, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_deposit_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `deposit-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`deposit-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_deposit_agreement_bert_en_1.0.0_3.0_1670349380582.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_deposit_agreement_bert_en_1.0.0_3.0_1670349380582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[deposit-agreement]|
|[other]|
|[other]|
|[deposit-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_deposit_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
deposit-agreement 0.97 0.97 0.97 36
other 0.98 0.98 0.98 65
accuracy - - 0.98 101
macro-avg 0.98 0.98 0.98 101
weighted-avg 0.98 0.98 0.98 101
```
---
layout: model
title: Legal Section headings Clause Binary Classifier
author: John Snow Labs
name: legclf_section_headings_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `section-headings` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `section-headings`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_section_headings_clause_en_1.0.0_3.2_1660122983672.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_section_headings_clause_en_1.0.0_3.2_1660122983672.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[section-headings]|
|[other]|
|[other]|
|[section-headings]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_section_headings_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 1.00 1.00 159
section-headings 1.00 1.00 1.00 46
accuracy - - 1.00 205
macro-avg 1.00 1.00 1.00 205
weighted-avg 1.00 1.00 1.00 205
```
---
layout: model
title: Extract Anatomical Entities from Voice of the Patient Documents (embeddings_clinical_large)
author: John Snow Labs
name: ner_vop_anatomy_emb_clinical_large
date: 2023-06-06
tags: [licensed, clinical, en, ner, vop, anatomy]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts anatomical terms from the documents transferred from the patient’s own sentences.
## Predicted Entities
`BodyPart`, `Laterality`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_emb_clinical_large_en_4.4.3_3.0_1686074062221.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_emb_clinical_large_en_4.4.3_3.0_1686074062221.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_anatomy_emb_clinical_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_anatomy_emb_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:----------|:------------|
| muscle | BodyPart |
| neck | BodyPart |
| trapezius | BodyPart |
| head | BodyPart |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_anatomy_emb_clinical_large|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.8 MB|
|Dependencies:|embeddings_clinical_large|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
BodyPart 2725 236 175 2900 0.92 0.94 0.93
Laterality 546 62 82 628 0.90 0.87 0.88
macro_avg 3271 298 257 3528 0.91 0.90 0.90
micro_avg 3271 298 257 3528 0.92 0.93 0.92
```
---
layout: model
title: Part of Speech for Bulgarian
author: John Snow Labs
name: pos_btb
date: 2021-03-23
tags: [pos, bg, open_source]
supported: true
task: Part of Speech Tagging
language: bg
edition: Spark NLP 2.7.5
spark_version: 2.4
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture.
## Predicted Entities
- ADJ
- ADP
- ADV
- AUX
- CCONJ
- DET
- NOUN
- NUM
- PART
- PRON
- PROPN
- PUNCT
- VERB
- X
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_btb_bg_2.7.5_2.4_1616506894131.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_btb_bg_2.7.5_2.4_1616506894131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_btb", "bg")\
.setInputCols(["document", "token"])\
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
posTagger
])
example = spark.createDataFrame([['Столица на Република България е град София .']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_btb", "bg")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, pos))
val data = Seq("Столица на Република България е град София .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Столица на Република България е град София .""]
token_df = nlu.load('bg.pos.btb').predict(text)
token_df
```
## Results
```bash
+--------------------------------------------+-------------------------------------------------+
|text |result |
+--------------------------------------------+-------------------------------------------------+
|Столица на Република България е град София .|[NOUN, ADP, NOUN, PROPN, AUX, NOUN, PROPN, PUNCT]|
+--------------------------------------------+-------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_btb|
|Compatibility:|Spark NLP 2.7.5+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[pos]|
|Language:|bg|
## Data Source
The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set.
## Benchmarking
```bash
| | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| ADJ | 0.89 | 0.87 | 0.88 | 1377 |
| ADP | 0.95 | 0.95 | 0.95 | 2238 |
| ADV | 0.94 | 0.92 | 0.93 | 671 |
| AUX | 0.98 | 0.97 | 0.97 | 916 |
| CCONJ | 0.96 | 0.95 | 0.96 | 467 |
| DET | 0.91 | 0.88 | 0.90 | 273 |
| INTJ | 1.00 | 1.00 | 1.00 | 1 |
| NOUN | 0.92 | 0.93 | 0.93 | 3486 |
| NUM | 0.89 | 0.87 | 0.88 | 223 |
| PART | 0.98 | 0.96 | 0.97 | 210 |
| PRON | 0.97 | 0.97 | 0.97 | 981 |
| PROPN | 0.88 | 0.89 | 0.89 | 805 |
| PUNCT | 0.95 | 0.96 | 0.95 | 2268 |
| SCONJ | 0.98 | 0.97 | 0.98 | 156 |
| VERB | 0.95 | 0.94 | 0.94 | 1652 |
| accuracy | | | 0.94 | 15724 |
| macro avg | 0.94 | 0.94 | 0.94 | 15724 |
| weighted avg | 0.94 | 0.94 | 0.94 | 15724 |
```
---
layout: model
title: English asr_wav2vec2_xlsr_53_phon TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: asr_wav2vec2_xlsr_53_phon
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_53_phon` is a English model originally trained by facebook.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_53_phon_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_53_phon_en_4.2.0_3.0_1664109116417.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_53_phon_en_4.2.0_3.0_1664109116417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xlsr_53_phon", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xlsr_53_phon", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xlsr_53_phon|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|756.9 MB|
---
layout: model
title: German XlmRoBertaForQuestionAnswering (from saattrupdan)
author: John Snow Labs
name: xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan
date: 2022-06-24
tags: [de, open_source, question_answering, xlmroberta]
task: Question Answering
language: de
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmr-base-texas-squad-de` is a German model originally trained by `saattrupdan`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan_de_4.0.0_3.0_1656062956033.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan_de_4.0.0_3.0_1656062956033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan","de") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan","de")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.answer_question.squad_de_tuned.xlmr_roberta.base.by_saattrupdan").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|de|
|Size:|874.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/saattrupdan/xlmr-base-texas-squad-de
---
layout: model
title: Word2Vec Embeddings in Sardinian (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, sc, open_source]
task: Embeddings
language: sc
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sc_3.4.1_3.0_1647455351489.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sc_3.4.1_3.0_1647455351489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sc") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sc")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("sc.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|sc|
|Size:|74.3 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Legal Permitted Use Clause Binary Classifier
author: John Snow Labs
name: legclf_permitted_use_clause
date: 2023-02-13
tags: [en, legal, classification, permitted, use, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `permitted_use` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`permitted_use`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_permitted_use_clause_en_1.0.0_3.0_1676305311848.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_permitted_use_clause_en_1.0.0_3.0_1676305311848.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[permitted_use]|
|[other]|
|[other]|
|[permitted_use]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_permitted_use_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.1 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 1.00 1.00 5
permitted_use 1.00 1.00 1.00 11
accuracy - - 1.00 16
macro-avg 1.00 1.00 1.00 16
weighted-avg 1.00 1.00 1.00 16
```
---
layout: model
title: Legal Bonus Clause Binary Classifier
author: John Snow Labs
name: legclf_bonus_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `bonus` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `bonus`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bonus_clause_en_1.0.0_3.2_1660122172503.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bonus_clause_en_1.0.0_3.2_1660122172503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[bonus]|
|[other]|
|[other]|
|[bonus]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_bonus_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
bonus 0.97 0.95 0.96 38
other 0.98 0.99 0.98 95
accuracy - - 0.98 133
macro-avg 0.98 0.97 0.97 133
weighted-avg 0.98 0.98 0.98 133
```
---
layout: model
title: Detect Diagnoses and Procedures (Spanish)
author: John Snow Labs
name: ner_diag_proc
date: 2021-03-31
tags: [ner, clinical, licensed, es]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.Pretrained named entity recognition deep learning model for diagnostics and procedures in spanish
## Predicted Entities
``DIAGNOSTICO``, ``PROCEDIMIENTO``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DIAG_PROC_ES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DIAG_PROC_ES.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diag_proc_es_3.0.0_3.0_1617208422892.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diag_proc_es_3.0.0_3.0_1617208422892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embed = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
model = MedicalNerModel.pretrained("ner_diag_proc","es","clinical/models")\
.setInputCols("sentence","token","word_embeddings")\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embed, model, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.']], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embed = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("word_embeddings")
val model = MedicalNerModel.pretrained("ner_diag_proc","es","clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embed, model, ner_converter))
val data = Seq("""HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.med_ner").predict("""HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.""")
```
## Results
```bash
+----------------------+-------------+
|chunk |ner_label |
+----------------------+-------------+
|ENFERMEDAD |DIAGNOSTICO |
|cáncer de vejiga |DIAGNOSTICO |
|resección |PROCEDIMIENTO|
|cistectomía |PROCEDIMIENTO|
|estrés |DIAGNOSTICO |
|infarto subendocárdico|DIAGNOSTICO |
|enfermedad |DIAGNOSTICO |
|arterias coronarias |DIAGNOSTICO |
+----------------------+-------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_diag_proc|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|es|
## Benchmarking
```bash
+-------------+------+------+------+------+---------+------+------+
| entity| tp| fp| fn| total|precision|recall| f1|
+-------------+------+------+------+------+---------+------+------+
|PROCEDIMIENTO|2299.0|1103.0| 860.0|3159.0| 0.6758|0.7278|0.7008|
| DIAGNOSTICO|6623.0|1364.0|2974.0|9597.0| 0.8292|0.6901|0.7533|
+-------------+------+------+------+------+---------+------+------+
+------------------+
| macro|
+------------------+
|0.7270531284138397|
+------------------+
+------------------+
| micro|
+------------------+
|0.7402992400932049|
+------------------+
```
---
layout: model
title: English BertForQuestionAnswering model (from motiondew)
author: John Snow Labs
name: bert_qa_bert_finetuned_lr2_e5_b16_ep2
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-lr2-e5-b16-ep2` is a English model orginally trained by `motiondew`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_lr2_e5_b16_ep2_en_4.0.0_3.0_1654535195058.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_lr2_e5_b16_ep2_en_4.0.0_3.0_1654535195058.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_lr2_e5_b16_ep2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_finetuned_lr2_e5_b16_ep2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.by_motiondew").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_finetuned_lr2_e5_b16_ep2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/motiondew/bert-finetuned-lr2-e5-b16-ep2
---
layout: model
title: Spanish RobertaForQuestionAnswering (from mrm8488)
author: John Snow Labs
name: roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac
date: 2022-06-20
tags: [es, open_source, question_answering, roberta]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-finetuned-sqac` is a Spanish model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac_es_4.0.0_3.0_1655729996691.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac_es_4.0.0_3.0_1655729996691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.sqac.roberta.base.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|es|
|Size:|460.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/roberta-base-bne-finetuned-sqac
- https://paperswithcode.com/sota?task=Question+Answering&dataset=sqac
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from clementgyj)
author: John Snow Labs
name: roberta_qa_finetuned_squad_50k
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-squad-50k` is a English model originally trained by `clementgyj`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_squad_50k_en_4.3.0_3.0_1674220438911.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_squad_50k_en_4.3.0_3.0_1674220438911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_squad_50k","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_squad_50k","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_finetuned_squad_50k|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|462.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/clementgyj/roberta-finetuned-squad-50k
---
layout: model
title: English AlbertForQuestionAnswering model (from twmkn9)
author: John Snow Labs
name: albert_base_qa_squad2
date: 2022-06-15
tags: [open_source, albert, question_answering, en]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: AlBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-base-v2-squad2` is a English model originally trained by `twmkn9`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_base_qa_squad2_en_4.0.0_3.0_1655294222450.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_base_qa_squad2_en_4.0.0_3.0_1655294222450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = AlbertForQuestionAnswering.pretrained("albert_base_qa_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_base_qa_squad2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.span_question.albert").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_base_qa_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|42.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
https://huggingface.co/twmkn9/albert-base-v2-squad2
---
layout: model
title: Clinical Deidentification
author: John Snow Labs
name: clinical_deidentification
date: 2023-06-13
tags: [deidentification, pipeline, de, licensed, clinical]
task: Pipeline Healthcare
language: de
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline can be used to deidentify PHI information from **German** medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`, `CONTACT`, `ID`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `DLN`, `PLATE` entities.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_de_4.4.4_3.2_1686663693325.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_de_4.4.4_3.2_1686663693325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "de", "clinical/models")
sample = """Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert.
Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.
Persönliche Daten :
ID-Nummer: T0110053F
Platte A-BC124
Kontonummer: DE89370400440532013000
SSN : 13110587M565
Lizenznummer: B072RRE2I55
Adresse : St.Johann-Straße 13 19300
"""
result = deid_pipeline.annotate(sample)
print("\n".join(result['masked']))
print("\n".join(result['masked_with_chars']))
print("\n".join(result['masked_fixed_length_chars']))
print("\n".join(result['obfuscated']))
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = PretrainedPipeline("clinical_deidentification","de","clinical/models")
val sample = "Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert.
Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.
Persönliche Daten :
ID-Nummer: T0110053F
Platte A-BC124
Kontonummer: DE89370400440532013000
SSN : 13110587M565
Lizenznummer: B072RRE2I55
Adresse : St.Johann-Straße 13 19300"
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.deid.clinical").predict("""Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert.
Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.
Persönliche Daten :
ID-Nummer: T0110053F
Platte A-BC124
Kontonummer: DE89370400440532013000
SSN : 13110587M565
Lizenznummer: B072RRE2I55
Adresse : St.Johann-Straße 13 19300
""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "de", "clinical/models")
sample = """Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert.
Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.
Persönliche Daten :
ID-Nummer: T0110053F
Platte A-BC124
Kontonummer: DE89370400440532013000
SSN : 13110587M565
Lizenznummer: B072RRE2I55
Adresse : St.Johann-Straße 13 19300
"""
result = deid_pipeline.annotate(sample)
print("\n".join(result['masked']))
print("\n".join(result['masked_with_chars']))
print("\n".join(result['masked_fixed_length_chars']))
print("\n".join(result['obfuscated']))
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = PretrainedPipeline("clinical_deidentification","de","clinical/models")
val sample = "Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert.
Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.
Persönliche Daten :
ID-Nummer: T0110053F
Platte A-BC124
Kontonummer: DE89370400440532013000
SSN : 13110587M565
Lizenznummer: B072RRE2I55
Adresse : St.Johann-Straße 13 19300"
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.deid.clinical").predict("""Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert.
Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.
Persönliche Daten :
ID-Nummer: T0110053F
Platte A-BC124
Kontonummer: DE89370400440532013000
SSN : 13110587M565
Lizenznummer: B072RRE2I55
Adresse : St.Johann-Straße 13 19300
""")
```
## Results
```bash
Results
Masked with entity labels
------------------------------
Zusammenfassung : wird am Morgen des ins eingeliefert.
Herr ist Jahre alt und hat zu viel Wasser in den Beinen.
Persönliche Daten :
ID-Nummer:
Platte
Kontonummer:
SSN :
Lizenznummer:
Adresse :
Masked with chars
------------------------------
Zusammenfassung : [************] wird am Morgen des [**************] ins [**********************] eingeliefert.
Herr [************] ist ** Jahre alt und hat zu viel Wasser in den Beinen.
Persönliche Daten :
ID-Nummer: [*******]
Platte [*****]
Kontonummer: [********************]
SSN : [**********]
Lizenznummer: [*********]
Adresse : [*****************] [***]
Masked with fixed length chars
------------------------------
Zusammenfassung : **** wird am Morgen des **** ins **** eingeliefert.
Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen.
Persönliche Daten :
ID-Nummer: ****
Platte ****
Kontonummer: ****
SSN : ****
Lizenznummer: ****
Adresse : **** ****
Obfusceted
------------------------------
Zusammenfassung : Herrmann Kallert wird am Morgen des 11-26-1977 ins International Neuroscience eingeliefert.
Herr Herrmann Kallert ist 79 Jahre alt und hat zu viel Wasser in den Beinen.
Persönliche Daten :
ID-Nummer: 136704D357
Platte QA348G
Kontonummer: 192837465738
SSN : 1310011981M454
Lizenznummer: XX123456
Adresse : Klingelhöferring 31206
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|de|
|Size:|1.3 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- ChunkMergeModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- Finisher
---
layout: model
title: German asr_exp_w2v2t_vp_100k_s627 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: asr_exp_w2v2t_vp_100k_s627
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_100k_s627` is a German model originally trained by jonatasgrosman.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2t_vp_100k_s627_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_vp_100k_s627_de_4.2.0_3.0_1664105815417.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_vp_100k_s627_de_4.2.0_3.0_1664105815417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_exp_w2v2t_vp_100k_s627", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_exp_w2v2t_vp_100k_s627", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_exp_w2v2t_vp_100k_s627|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739775556.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739775556.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base_rule_based_twostagetriplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|306.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0
---
layout: model
title: Legal Rights And Freedoms Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_rights_and_freedoms_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, rights_and_freedoms, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_rights_and_freedoms_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Rights_and_Freedoms or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Rights_and_Freedoms`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_rights_and_freedoms_bert_en_1.0.0_3.0_1678111839271.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_rights_and_freedoms_bert_en_1.0.0_3.0_1678111839271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Rights_and_Freedoms]|
|[Other]|
|[Other]|
|[Rights_and_Freedoms]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_rights_and_freedoms_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.2 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.89 0.85 0.87 39
Rights_and_Freedoms 0.79 0.85 0.81 26
accuracy - - 0.85 65
macro-avg 0.84 0.85 0.84 65
weighted-avg 0.85 0.85 0.85 65
```
---
layout: model
title: Portuguese asr_bp_lapsbm1_xlsr TFWav2Vec2ForCTC from lgris
author: John Snow Labs
name: asr_bp_lapsbm1_xlsr
date: 2022-09-26
tags: [wav2vec2, pt, audio, open_source, asr]
task: Automatic Speech Recognition
language: pt
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp_lapsbm1_xlsr` is a Portuguese model originally trained by lgris.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp_lapsbm1_xlsr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp_lapsbm1_xlsr_pt_4.2.0_3.0_1664190605281.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp_lapsbm1_xlsr_pt_4.2.0_3.0_1664190605281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_bp_lapsbm1_xlsr", "pt")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_bp_lapsbm1_xlsr", "pt")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_bp_lapsbm1_xlsr|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|pt|
|Size:|756.4 MB|
---
layout: model
title: Fast Neural Machine Translation Model from English to Cushitic languages
author: John Snow Labs
name: opus_mt_en_cus
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, cus, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `cus`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_cus_xx_2.7.0_2.4_1609168898891.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_cus_xx_2.7.0_2.4_1609168898891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_cus", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_cus", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.cus').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_cus|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: BERT Sentence Embeddings (Large Uncased)
author: John Snow Labs
name: sent_bert_large_uncased
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_large_uncased_en_2.6.0_2.4_1598347026632.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_large_uncased_en_2.6.0_2.4_1598347026632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_bert_large_uncased", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_large_uncased", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.bert_large_uncased').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
sentence en_embed_sentence_bert_large_uncased_embeddings
I hate cancer [[-0.13290119171142578, -0.2996622622013092, -...
Antibiotics aren't painkiller [[-0.13290119171142578, -0.2996622622013092, -...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_large_uncased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|1024|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/google/bert_uncased_L-24_H-1024_A-16/1](https://tfhub.dev/google/bert_uncased_L-24_H-1024_A-16/1)
---
layout: model
title: Chinese Part of Speech Tagger (from ckiplab)
author: John Snow Labs
name: bert_pos_bert_base_chinese_pos
date: 2022-04-26
tags: [bert, pos, part_of_speech, zh, open_source]
task: Part of Speech Tagging
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-chinese-pos` is a Chinese model orginally trained by `ckiplab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_chinese_pos_zh_3.4.2_3.0_1650993041893.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_chinese_pos_zh_3.4.2_3.0_1650993041893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_chinese_pos","zh") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_chinese_pos","zh")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.pos.bert_base_chinese_pos").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_bert_base_chinese_pos|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|zh|
|Size:|381.8 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ckiplab/bert-base-chinese-pos
- https://github.com/ckiplab/ckip-transformers
- https://muyang.pro
- https://ckip.iis.sinica.edu.tw
- https://github.com/ckiplab/ckip-transformers
- https://github.com/ckiplab/ckip-transformers
---
layout: model
title: Swedish Legal Roberta Embeddings
author: John Snow Labs
name: roberta_base_swedish_legal
date: 2023-02-17
tags: [se, swedish, embeddings, transformer, open_source, legal, tensorflow]
task: Embeddings
language: se
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-swedish-roberta-base` is a Swedish model originally trained by `joelito`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_swedish_legal_se_4.2.4_3.0_1676643288694.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_swedish_legal_se_4.2.4_3.0_1676643288694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_base_swedish_legal|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|se|
|Size:|415.9 MB|
|Case sensitive:|true|
## References
https://huggingface.co/joelito/legal-swedish-roberta-base
---
layout: model
title: Financial Financial conditions Item Binary Classifier
author: John Snow Labs
name: finclf_financial_conditions_item
date: 2022-08-10
tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `financial_conditions` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
## Predicted Entities
`other`, `financial_conditions`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_financial_conditions_item_en_1.0.0_3.2_1660154420184.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_financial_conditions_item_en_1.0.0_3.2_1660154420184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[financial_conditions]|
|[other]|
|[other]|
|[financial_conditions]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_financial_conditions_item|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.5 MB|
## References
Weak labelling on documents from Edgar database
## Benchmarking
```bash
label precision recall f1-score support
financial_conditions 0.83 0.73 0.78 245
other 0.75 0.84 0.80 237
accuracy - - 0.79 482
macro-avg 0.79 0.79 0.79 482
weighted-avg 0.79 0.79 0.79 482
```
---
layout: model
title: Legal Non Exclusivity Clause Binary Classifier
author: John Snow Labs
name: legclf_non_exclusivity_clause
date: 2023-01-29
tags: [en, legal, classification, exclusivity, clauses, non_exclusivity, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `non-exclusivity` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`non-exclusivity`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_exclusivity_clause_en_1.0.0_3.0_1675006033580.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_exclusivity_clause_en_1.0.0_3.0_1675006033580.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[non-exclusivity]|
|[other]|
|[other]|
|[non-exclusivity]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_non_exclusivity_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
non-exclusivity 0.93 0.96 0.95 27
other 0.97 0.95 0.96 39
accuracy - - 0.95 66
macro-avg 0.95 0.96 0.95 66
weighted-avg 0.96 0.95 0.95 66
```
---
layout: model
title: Finnish asr_wav2vec2_large_xlsr_finnish TFWav2Vec2ForCTC from birgermoell
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_finnish
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_finnish` is a Finnish model originally trained by birgermoell.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_finnish_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_finnish_fi_4.2.0_3.0_1664021375004.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_finnish_fi_4.2.0_3.0_1664021375004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_finnish", "fi")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_finnish", "fi")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_finnish|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|fi|
|Size:|1.2 GB|
---
layout: model
title: Legal Intellectual property Clause Binary Classifier
author: John Snow Labs
name: legclf_intellectual_property_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `intellectual-property` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `intellectual-property`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_intellectual_property_clause_en_1.0.0_3.2_1660123623906.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_intellectual_property_clause_en_1.0.0_3.2_1660123623906.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[intellectual-property]|
|[other]|
|[other]|
|[intellectual-property]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_intellectual_property_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
intellectual-property 0.95 0.85 0.90 47
other 0.93 0.98 0.95 95
accuracy - - 0.94 142
macro-avg 0.94 0.92 0.93 142
weighted-avg 0.94 0.94 0.94 142
```
---
layout: model
title: German Part of Speech Tagger (from KoichiYasuoka)
author: John Snow Labs
name: bert_pos_bert_large_german_upos
date: 2022-05-09
tags: [bert, pos, part_of_speech, de, open_source]
task: Part of Speech Tagging
language: de
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-german-upos` is a German model orginally trained by `KoichiYasuoka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_german_upos_de_3.4.2_3.0_1652092375858.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_german_upos_de_3.4.2_3.0_1652092375858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_german_upos","de") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_german_upos","de")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Ich liebe Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_bert_large_german_upos|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/KoichiYasuoka/bert-large-german-upos
- https://github.com/UniversalDependencies/UD_German-HDT
- https://universaldependencies.org/u/pos/
- https://github.com/KoichiYasuoka/esupar
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-32-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8_en_4.0.0_3.0_1657185089636.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8_en_4.0.0_3.0_1657185089636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-32-finetuned-squad-seed-8
---
layout: model
title: Sentence Embeddings - sbert medium (tuned)
author: John Snow Labs
name: sbiobert_jsl_rxnorm_cased
date: 2021-12-23
tags: [licensed, embeddings, clinical, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps sentences & documents to a 768 dimensional dense vector space by using average pooling on top of BioBert model. It's also fine-tuned on RxNorm dataset to help generalization over medication-related datasets.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_rxnorm_cased_en_3.3.4_2.4_1640271525048.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_rxnorm_cased_en_3.3.4_2.4_1640271525048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_jsl_rxnorm_cased", "en", "clinical/models")\
.setInputCols(["sentence"])\
.setOutputCol("sbioert_embeddings")
```
```scala
val sentence_embeddings = BertSentenceEmbeddings.pretrained('sbiobert_jsl_rxnorm_cased', 'en','clinical/models')
.setInputCols("sentence")
.setOutputCol("sbioert_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed_sentence.biobert.rxnorm").predict("""Put your text here.""")
```
## Results
```bash
Gives a 768-dimensional vector representation of the sentence.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobert_jsl_rxnorm_cased|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|402.0 MB|
---
layout: model
title: Pipeline to Detect Medication Entities, Assign Assertion Status and Find Relations
author: John Snow Labs
name: explain_clinical_doc_medication
date: 2023-04-20
tags: [licensed, clinical, ner, en, assertion, relation_extraction, posology, medication]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pipeline for detecting posology entities with the `ner_posology_large` NER model, assigning their assertion status with `assertion_jsl` model, and extracting relations between posology-related terminology with `posology_re` relation extraction model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_medication_en_4.3.0_3.2_1682017727303.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_medication_en_4.3.0_3.2_1682017727303.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("explain_clinical_doc_medication", "en", "clinical/models")
text = '''The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("explain_clinical_doc_medication", "en", "clinical/models")
val text = "The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.explain_dco.clinical_medication.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""")
```
## Results
```bash
+----+----------------+------------+
| | chunks | entities |
|---:|:---------------|:-----------|
| 0 | insulin | DRUG |
| 1 | Bactrim | DRUG |
| 2 | for 14 days | DURATION |
| 3 | 5000 units | DOSAGE |
| 4 | Fragmin | DRUG |
| 5 | subcutaneously | ROUTE |
| 6 | daily | FREQUENCY |
| 7 | Lantus | DRUG |
| 8 | 40 units | DOSAGE |
| 9 | subcutaneously | ROUTE |
| 10 | at bedtime | FREQUENCY |
+----+----------------+------------+
+----+----------+------------+-------------+
| | chunks | entities | assertion |
|---:|:---------|:-----------|:------------|
| 0 | insulin | DRUG | Present |
| 1 | Bactrim | DRUG | Past |
| 2 | Fragmin | DRUG | Planned |
| 3 | Lantus | DRUG | Planned |
+----+----------+------------+-------------+
+----------------+-----------+------------+-----------+----------------+
| relation | entity1 | chunk1 | entity2 | chunk2 |
|:---------------|:----------|:-----------|:----------|:---------------|
| DRUG-DURATION | DRUG | Bactrim | DURATION | for 14 days |
| DOSAGE-DRUG | DOSAGE | 5000 units | DRUG | Fragmin |
| DRUG-ROUTE | DRUG | Fragmin | ROUTE | subcutaneously |
| DRUG-FREQUENCY | DRUG | Fragmin | FREQUENCY | daily |
| DRUG-DOSAGE | DRUG | Lantus | DOSAGE | 40 units |
| DRUG-ROUTE | DRUG | Lantus | ROUTE | subcutaneously |
| DRUG-FREQUENCY | DRUG | Lantus | FREQUENCY | at bedtime |
+----------------+-----------+------------+-----------+----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_clinical_doc_medication|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
- NerConverterInternalModel
- AssertionDLModel
- PerceptronModel
- DependencyParserModel
- PosologyREModel
---
layout: model
title: Bemba (Zambia) asr_wav2vec2_large_xls_r_300m_bemba_fds TFWav2Vec2ForCTC from csikasote
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds
date: 2022-09-24
tags: [wav2vec2, bem, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: bem
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_bemba_fds` is a Bemba (Zambia) model originally trained by csikasote.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds_bem_4.2.0_3.0_1664023955232.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds_bem_4.2.0_3.0_1664023955232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds', lang = 'bem')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds", lang = "bem")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|bem|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English image_classifier_vit_ViTFineTuned ViTForImageClassification from pthpth
author: John Snow Labs
name: image_classifier_vit_ViTFineTuned
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_ViTFineTuned` is a English model originally trained by pthpth.
## Predicted Entities
`white_bread`, `brown_bread`, `cracker`, `aluminium_foil`, `linen`, `wool`, `corduroy`, `wood`, `lettuce_leaf`, `cotton`, `cork`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ViTFineTuned_en_4.1.0_3.0_1660167943982.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ViTFineTuned_en_4.1.0_3.0_1660167943982.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_ViTFineTuned", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_ViTFineTuned", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_ViTFineTuned|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Legal Effectiveness Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_effectiveness_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, effectiveness, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Effectiveness` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Effectiveness`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_effectiveness_bert_en_1.0.0_3.0_1678050012884.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_effectiveness_bert_en_1.0.0_3.0_1678050012884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Effectiveness]|
|[Other]|
|[Other]|
|[Effectiveness]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_effectiveness_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Effectiveness 0.88 0.92 0.90 24
Other 0.94 0.92 0.93 36
accuracy - - 0.92 60
macro-avg 0.91 0.92 0.91 60
weighted-avg 0.92 0.92 0.92 60
```
---
layout: model
title: German asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545
date: 2022-09-26
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545` is a German model originally trained by jonatasgrosman.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545_de_4.2.0_3.0_1664191216048.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545_de_4.2.0_3.0_1664191216048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: Part of Speech for Afrikaans
author: John Snow Labs
name: pos_afribooms
date: 2021-03-16
tags: [af, open_source, pos]
supported: true
task: Part of Speech Tagging
language: af
edition: Spark NLP 2.7.5
spark_version: 2.4
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture.
## Predicted Entities
- ADJ
- ADP
- ADV
- AUX
- CCONJ
- DET
- NOUN
- NUM
- PART
- PRON
- PROPN
- PUNCT
- VERB
- X
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_afribooms_af_2.7.5_2.4_1615903333785.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_afribooms_af_2.7.5_2.4_1615903333785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_afribooms", "af")\
.setInputCols(["document", "token"])\
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
posTagger
])
example = spark.createDataFrame([['Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_afribooms", "af")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer ,pos))
val data = Seq("Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .""]
token_df = nlu.load('af.pos.afribooms').predict(text)
token_df
```
## Results
```bash
+---------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
|text |result |
+---------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
|Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .|[DET, NOUN, PRON, VERB, AUX, PUNCT, AUX, ADJ, CCONJ, ADJ, ADP, NOUN, CCONJ, NOUN, AUX, PUNCT]|
+---------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_afribooms|
|Compatibility:|Spark NLP 2.7.5+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[pos]|
|Language:|af|
## Data Source
The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set.
## Benchmarking
```bash
| | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| ADJ | 0.60 | 0.67 | 0.63 | 665 |
| ADP | 0.76 | 0.78 | 0.77 | 1299 |
| ADV | 0.74 | 0.69 | 0.72 | 523 |
| AUX | 0.85 | 0.83 | 0.84 | 663 |
| CCONJ | 0.71 | 0.71 | 0.71 | 380 |
| DET | 0.83 | 0.70 | 0.76 | 1014 |
| NOUN | 0.69 | 0.72 | 0.71 | 2025 |
| NUM | 0.76 | 0.76 | 0.76 | 42 |
| PART | 0.67 | 0.68 | 0.68 | 322 |
| PRON | 0.87 | 0.87 | 0.87 | 794 |
| PROPN | 0.82 | 0.73 | 0.77 | 156 |
| PUNCT | 0.68 | 0.70 | 0.69 | 877 |
| SCONJ | 0.85 | 0.85 | 0.85 | 210 |
| SYM | 0.87 | 0.88 | 0.87 | 142 |
| VERB | 0.69 | 0.72 | 0.70 | 889 |
| X | 0.35 | 0.14 | 0.20 | 64 |
| accuracy | | | 0.74 | 10065 |
| macro avg | 0.73 | 0.72 | 0.72 | 10065 |
| weighted avg | 0.74 | 0.74 | 0.74 | 10065 |
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from vitusya)
author: John Snow Labs
name: distilbert_qa_vitusya_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vitusya`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vitusya_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726511045.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vitusya_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726511045.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vitusya_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vitusya_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_vitusya").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_vitusya_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/vitusya/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Legal Investment Subadvisory Agreement Document Classifier (Longformer)
author: John Snow Labs
name: legclf_investment_subadvisory_agreement
date: 2022-11-10
tags: [en, legal, classification, agreement, investment_subadvisory, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_investment_subadvisory_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `investment-subadvisory-agreement` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`investment-subadvisory-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_investment_subadvisory_agreement_en_1.0.0_3.0_1668115303552.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_investment_subadvisory_agreement_en_1.0.0_3.0_1668115303552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[investment-subadvisory-agreement]|
|[other]|
|[other]|
|[investment-subadvisory-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_investment_subadvisory_agreement|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
investment-subadvisory-agreement 1.00 0.98 0.99 42
other 0.99 1.00 0.99 66
accuracy - - 0.99 108
macro-avg 0.99 0.99 0.99 108
weighted-avg 0.99 0.99 0.99 108
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Walterchamy)
author: John Snow Labs
name: distilbert_qa_walterchamy_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Walterchamy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_walterchamy_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769525196.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_walterchamy_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769525196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_walterchamy_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_walterchamy_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_walterchamy_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Walterchamy/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Medically Sound Suggestion Classifier (BioBERT)
author: John Snow Labs
name: bert_sequence_classifier_vop_sound_medical
date: 2023-06-13
tags: [licensed, clinical, classification, en, vop, tensorflow]
task: Text Classification
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier is meant to identify whether the suggestion that is mentioned in the text is medically sound.
## Predicted Entities
`True`, `False`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_sound_medical_en_4.4.3_3.0_1686673701807.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_sound_medical_en_4.4.3_3.0_1686673701807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_sound_medical", "en", "clinical/models")\
.setInputCols(["document",'token'])\
.setOutputCol("prediction")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame(["I had a lung surgery for emphyema and after surgery my xray showing some recovery.",
"I was advised to put honey on a burned skin."], StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_sound_medical", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("prediction")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq(Array("I had a lung surgery for emphyema and after surgery my xray showing some recovery.",
"I was advised to put honey on a burned skin.")).toDS.toDF("text")
```
## Results
```bash
+----------------------------------------------------------------------------------+-------+
|text |result |
+----------------------------------------------------------------------------------+-------+
|I had a lung surgery for emphyema and after surgery my xray showing some recovery.|[True] |
|I was advised to put honey on a burned skin. |[False]|
+----------------------------------------------------------------------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_vop_sound_medical|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
“Hello,I’m 20 year old girl. I’m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I’m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I’m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I’m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.”
## Benchmarking
```bash
label precision recall f1-score support
False 0.848564 0.752315 0.797546 432
True 0.664577 0.785185 0.719864 270
accuracy - - 0.764957 702
macro_avg 0.756570 0.768750 0.758705 702
weighted_avg 0.777800 0.764957 0.767668 702
```
---
layout: model
title: English Bert Embeddings (Base, Uncased, Unstructured)
author: John Snow Labs
name: bert_embeddings_bert_base_uncased_sparse_70_unstructured
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-uncased-sparse-70-unstructured` is a English model orginally trained by `Intel`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_uncased_sparse_70_unstructured_en_3.4.2_3.0_1649672494464.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_uncased_sparse_70_unstructured_en_3.4.2_3.0_1649672494464.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_uncased_sparse_70_unstructured","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_uncased_sparse_70_unstructured","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.bert_base_uncased_sparse_70_unstructured").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_uncased_sparse_70_unstructured|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|228.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Intel/bert-base-uncased-sparse-70-unstructured
---
layout: model
title: Word2Vec Embeddings in Russian (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, ru, open_source]
task: Embeddings
language: ru
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ru_3.4.1_3.0_1647455083959.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ru_3.4.1_3.0_1647455083959.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ru") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Я люблю искра NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ru")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Я люблю искра NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ru.embed.w2v_cc_300d").predict("""Я люблю искра NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|ru|
|Size:|1.3 GB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Legal Asia And Oceania Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_asia_and_oceania_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, asia_and_oceania, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_asia_and_oceania_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Asia_and_Oceania or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Asia_and_Oceania`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_asia_and_oceania_bert_en_1.0.0_3.0_1678111638726.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_asia_and_oceania_bert_en_1.0.0_3.0_1678111638726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Asia_and_Oceania]|
|[Other]|
|[Other]|
|[Asia_and_Oceania]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_asia_and_oceania_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.8 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Asia_and_Oceania 0.92 0.91 0.92 456
Other 0.90 0.92 0.91 400
accuracy - - 0.91 856
macro-avg 0.91 0.91 0.91 856
weighted-avg 0.91 0.91 0.91 856
```
---
layout: model
title: Hindi RoBERTa Embeddings (from neuralspace-reverie)
author: John Snow Labs
name: roberta_embeddings_indic_transformers_hi_roberta
date: 2022-04-14
tags: [roberta, embeddings, hi, open_source]
task: Embeddings
language: hi
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-hi-roberta` is a Hindi model orginally trained by `neuralspace-reverie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_hi_roberta_hi_3.4.2_3.0_1649947526435.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_hi_roberta_hi_3.4.2_3.0_1649947526435.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers_hi_roberta","hi") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers_hi_roberta","hi")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("hi.embed.indic_transformers_hi_roberta").predict("""मुझे स्पार्क एनएलपी पसंद है""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_indic_transformers_hi_roberta|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|hi|
|Size:|313.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/neuralspace-reverie/indic-transformers-hi-roberta
- https://oscar-corpus.com/
---
layout: model
title: Dutch BERT Sentence Base Cased Embedding
author: John Snow Labs
name: sent_bert_base_cased
date: 2021-09-06
tags: [dutch, open_source, bert_sentence_embeddings, cased, nl]
task: Embeddings
language: nl
edition: Spark NLP 3.2.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_nl_3.2.2_3.0_1630926264607.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_nl_3.2.2_3.0_1630926264607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "nl") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
```
```scala
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "nl")
.setInputCols("sentence")
.setOutputCol("bert_sentence")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
```
{:.nlu-block}
```python
import nlu
nlu.load("nl.embed_sentence.bert.base_cased").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_base_cased|
|Compatibility:|Spark NLP 3.2.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[bert_sentence]|
|Language:|nl|
|Case sensitive:|true|
## Data Source
The model is imported from: https://huggingface.co/GroNLP/bert-base-dutch-cased
---
layout: model
title: English DebertaForQuestionAnswering model (from nbroad)
author: John Snow Labs
name: deberta_v3_xsmall_qa_squad2
date: 2022-06-15
tags: [open_source, deberta, question_answering, en]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DeBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-v3-xsmall-squad2` is a English model originally trained by `nbroad`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_v3_xsmall_qa_squad2_en_4.0.0_3.0_1655290640197.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_v3_xsmall_qa_squad2_en_4.0.0_3.0_1655290640197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DebertaForQuestionAnswering.pretrained("deberta_v3_xsmall_qa_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DebertaForQuestionAnswering.pretrained("deberta_v3_xsmall_qa_squad2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.deberta").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|deberta_v3_xsmall_qa_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|252.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
https://huggingface.co/nbroad/deberta-v3-xsmall-squad2
---
layout: model
title: Relation Extraction between dates and other entities (ReDL)
author: John Snow Labs
name: redl_oncology_temporal_biobert_wip
date: 2022-09-29
tags: [licensed, clinical, oncology, en, relation_extraction, temporal]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.0
supported: true
annotator: RelationExtractionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This relation extraction model links Date and Relative_Date extractions to clinical entities such as Test or Cancer_Dx.
## Predicted Entities
`is_date_of`, `O`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_temporal_biobert_wip_en_4.1.0_3.0_1664456191667.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_temporal_biobert_wip_en_4.1.0_3.0_1664456191667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Each relevant relation pair in the pipeline should include one date entity (Date or Relative_Date) and a clinical entity (such as Pathology_Test, Cancer_Dx or Chemotherapy).
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos_tags")
dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \
.setInputCols(["sentence", "pos_tags", "token"]) \
.setOutputCol("dependencies")
re_ner_chunk_filter = RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["Cancer_Dx-Date", "Date-Cancer_Dx", "Relative_Date-Cancer_Dx", "Cancer_Dx-Relative_Date", "Cancer_Surgery-Date", "Date-Cancer_Surgery", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"])
re_model = RelationExtractionDLModel.pretrained("redl_oncology_temporal_biobert_wip", "en", "clinical/models")\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relation_extraction")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model])
data = spark.createDataFrame([["Her breast cancer was diagnosed last year."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos_tags", "token"))
.setOutputCol("dependencies")
val re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols("ner_chunk", "dependencies")
.setOutputCol("re_ner_chunk")
.setMaxSyntacticDistance(10)
.setRelationPairs(Array("Cancer_Dx-Date", "Date-Cancer_Dx", "Relative_Date-Cancer_Dx", "Cancer_Dx-Relative_Date", "Cancer_Surgery-Date", "Date-Cancer_Surgery", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"))
val re_model = RelationExtractionDLModel.pretrained("redl_oncology_temporal_biobert_wip", "en", "clinical/models")
.setPredictionThreshold(0.5f)
.setInputCols("re_ner_chunk", "sentence")
.setOutputCol("relation_extraction")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model))
val data = Seq("Her breast cancer was diagnosed last year.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.oncology_temporal_biobert_wip").predict("""Her breast cancer was diagnosed last year.""")
```
## Results
```bash
| chunk1 | entity1 | chunk2 | entity2 | relation | confidence |
| --------------- |--------------- |---------------- |--------------- |----------- |----------- |
| breast cancer | Cancer_Dx | last year | Relative_Date | is_date_of | 0.9999256 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_oncology_temporal_biobert_wip|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|405.4 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label recall precision f1 support
O 0.77 0.81 0.79 302.0
is_date_of 0.82 0.78 0.80 298.0
macro-avg 0.79 0.79 0.79 NaN
```
---
layout: model
title: Dutch RoBERTa Embeddings (Merged)
author: John Snow Labs
name: roberta_embeddings_robbertje_1_gb_merged
date: 2022-04-14
tags: [roberta, embeddings, nl, open_source]
task: Embeddings
language: nl
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `robbertje-1-gb-merged` is a Dutch model orginally trained by `DTAI-KULeuven`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbertje_1_gb_merged_nl_3.4.2_3.0_1649949144654.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbertje_1_gb_merged_nl_3.4.2_3.0_1649949144654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbertje_1_gb_merged","nl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbertje_1_gb_merged","nl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ik hou van vonk nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("nl.embed.robbertje_1_gb_merged").predict("""Ik hou van vonk nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_robbertje_1_gb_merged|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|nl|
|Size:|279.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/DTAI-KULeuven/robbertje-1-gb-merged
- http://github.com/iPieter/robbert
- http://github.com/iPieter/robbertje
- https://www.clinjournal.org/clinj/article/view/131
- https://www.clin31.ugent.be
- https://arxiv.org/abs/2101.05716
---
layout: model
title: Part of Speech for Vietnamese
author: John Snow Labs
name: pos_vtb
date: 2021-03-10
tags: [open_source, pos, vi]
supported: true
task: Part of Speech Tagging
language: vi
edition: Spark NLP 2.7.5
spark_version: 2.4
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture.
## Predicted Entities
- ADJ
- ADP
- ADV
- AUX
- CCONJ
- DET
- NOUN
- NUM
- PART
- PRON
- PROPN
- PUNCT
- VERB
- X
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_vtb_vi_2.7.5_2.4_1615401332222.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_vtb_vi_2.7.5_2.4_1615401332222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_vtb", "vi") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
posTagger
])
example = spark.createDataFrame([['Thắng sẽ tìm nghề mới cho Lan .']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_vtb", "vi")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, pos))
val data = Seq("Thắng sẽ tìm nghề mới cho Lan .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Thắng sẽ tìm nghề mới cho Lan .""]
token_df = nlu.load('vi.pos.vtb').predict(text)
token_df
```
## Results
```bash
+-------------------------------+--------------------------------------------+
|text |result |
+-------------------------------+--------------------------------------------+
|Thắng sẽ tìm nghề mới cho Lan .|[NOUN, X, VERB, NOUN, ADJ, ADP, NOUN, PUNCT]|
+-------------------------------+--------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_vtb|
|Compatibility:|Spark NLP 2.7.5+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[pos]|
|Language:|vi|
## Data Source
The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set.
## Benchmarking
```bash
| | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| ADJ | 0.58 | 0.49 | 0.53 | 738 |
| ADP | 0.84 | 0.87 | 0.86 | 688 |
| AUX | 0.79 | 0.95 | 0.87 | 132 |
| CCONJ | 0.85 | 0.80 | 0.83 | 335 |
| DET | 0.95 | 0.85 | 0.90 | 232 |
| INTJ | 1.00 | 0.14 | 0.25 | 7 |
| NOUN | 0.84 | 0.86 | 0.85 | 3838 |
| NUM | 0.94 | 0.91 | 0.92 | 412 |
| PART | 0.53 | 0.30 | 0.38 | 87 |
| PROPN | 0.85 | 0.85 | 0.85 | 494 |
| PUNCT | 0.97 | 0.99 | 0.98 | 1722 |
| SCONJ | 0.99 | 0.98 | 0.98 | 122 |
| VERB | 0.73 | 0.76 | 0.74 | 2178 |
| X | 0.81 | 0.76 | 0.79 | 970 |
| accuracy | | | 0.83 | 11955 |
| macro avg | 0.83 | 0.75 | 0.77 | 11955 |
| weighted avg | 0.83 | 0.83 | 0.83 | 11955 |
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from autoevaluate)
author: John Snow Labs
name: distilbert_qa_autoevaluate_base_cased_led_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad` is a English model originally trained by `autoevaluate`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_autoevaluate_base_cased_led_squad_en_4.3.0_3.0_1672766463212.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_autoevaluate_base_cased_led_squad_en_4.3.0_3.0_1672766463212.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autoevaluate_base_cased_led_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autoevaluate_base_cased_led_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_autoevaluate_base_cased_led_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/autoevaluate/distilbert-base-cased-distilled-squad
---
layout: model
title: Legal Entire Agreements Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_entire_agreements_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, entire_agreements, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Entire_Agreements` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Entire_Agreements`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_entire_agreements_bert_en_1.0.0_3.0_1678050004746.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_entire_agreements_bert_en_1.0.0_3.0_1678050004746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Entire_Agreements]|
|[Other]|
|[Other]|
|[Entire_Agreements]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_entire_agreements_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Entire_Agreements 0.99 0.98 0.98 284
Other 0.98 0.99 0.98 312
accuracy - - 0.98 596
macro-avg 0.98 0.98 0.98 596
weighted-avg 0.98 0.98 0.98 596
```
---
layout: model
title: Detect clinical events
author: John Snow Labs
name: ner_events_healthcare
date: 2021-04-01
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Detect clinical events like Date, Occurance, Clinical_Department and a lot more using pretrained NER model.
## Predicted Entities
`OCCURRENCE`, `TREATMENT`, `TIME`, `DATE`, `PROBLEM`, `CLINICAL_DEPT`, `DURATION`, `EVIDENTIAL`, `FREQUENCY`, `TEST`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_healthcare_en_3.0.0_3.0_1617260839291.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_healthcare_en_3.0.0_3.0_1617260839291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_events_healthcare", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_events_healthcare", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.events_healthcre").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_events_healthcare|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Benchmarking
```bash
entity tp fp fn total precision recall f1
DURATION 575.0 263.0 231.0 806.0 0.6862 0.7134 0.6995
PROBLEM 8067.0 2479.0 2305.0 10372.0 0.7649 0.7778 0.7713
DATE 1787.0 508.0 315.0 2102.0 0.7786 0.8501 0.8128
CLINICAL_DEPT 1804.0 393.0 338.0 2142.0 0.8211 0.8422 0.8315
OCCURRENCE 1917.0 893.0 2188.0 4105.0 0.6822 0.467 0.5544
TREATMENT 4578.0 1596.0 1817.0 6395.0 0.7415 0.7159 0.7285
FREQUENCY 145.0 46.0 213.0 358.0 0.7592 0.405 0.5282
TEST 3723.0 949.0 1113.0 4836.0 0.7969 0.7699 0.7831
EVIDENTIAL 334.0 80.0 279.0 613.0 0.8068 0.5449 0.6504
macro - - - - - - 0.60759
micro - - - - - - 0.73065
```
---
layout: model
title: Detect Clinical Entities (jsl_ner_wip_greedy_clinical)
author: John Snow Labs
name: jsl_ner_wip_greedy_clinical
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
`Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Hyperlipidemia`, `Respiration`, `Birth_Entity`, `Age`, `Family_History_Header`, `Labour_Delivery`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Drug`, `Symptom`, `Treatment`, `Substance`, `Route`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Time`, `Frequency`, `Sexually_Active_or_Sexual_Orientation`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Hypertension`, `HDL`, `Overweight`, `Total_Cholesterol`, `Smoking`, `Date`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_clinical_en_3.0.0_3.0_1617206898504.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_clinical_en_3.0.0_3.0_1617206898504.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_greedy_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("jsl_ner")
jsl_ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "jsl_ner"]) \
.setOutputCol("ner_chunk")
jsl_ner_pipeline = Pipeline().setStages([
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
jsl_ner,
jsl_ner_converter])
jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text")
result = jsl_ner_model.transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_greedy_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("jsl_ner")
val jsl_ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "jsl_ner"))
.setOutputCol("ner_chunk")
val jsl_ner_pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
jsl_ner,
jsl_ner_converter))
val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text")
val result = jsl_ner_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.jsl.wip.clinical.greedy").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_for_patents","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_for_patents","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_for_patents|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/anferico/bert-for-patents
- https://cloud.google.com/blog/products/ai-machine-learning/how-ai-improves-patent-analysis
- https://services.google.com/fh/files/blogs/bert_for_patents_white_paper.pdf
- https://github.com/google/patents-public-data/blob/master/models/BERT%20for%20Patents.md
- https://github.com/ec-jrc/Patents4IPPC
- https://picampus-school.com/
- https://ec.europa.eu/jrc/en
---
layout: model
title: Italian Bert Embeddings (from bullmount)
author: John Snow Labs
name: bert_embeddings_hseBert_it_cased
date: 2022-04-11
tags: [bert, embeddings, it, open_source]
task: Embeddings
language: it
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `hseBert-it-cased` is a Italian model orginally trained by `bullmount`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_hseBert_it_cased_it_3.4.2_3.0_1649676875956.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_hseBert_it_cased_it_3.4.2_3.0_1649676875956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_hseBert_it_cased","it") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_hseBert_it_cased","it")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Adoro Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("it.embed.hseBert_it_cased").predict("""Adoro Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_hseBert_it_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|it|
|Size:|412.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/bullmount/hseBert-it-cased
---
layout: model
title: Castilian, Spanish BertForQuestionAnswering model (from CenIA)
author: John Snow Labs
name: bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-qa-tar` is a Castilian, Spanish model orginally trained by `CenIA`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar_es_4.0.0_3.0_1654180441819.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar_es_4.0.0_3.0_1654180441819.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.bert.base_cased.by_CenIA").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|410.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/CenIA/bert-base-spanish-wwm-cased-finetuned-qa-tar
---
layout: model
title: Fast Neural Machine Translation Model from Niuean to English
author: John Snow Labs
name: opus_mt_niu_en
date: 2020-12-29
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, niu, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `niu`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_niu_en_xx_2.7.0_2.4_1609254471115.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_niu_en_xx_2.7.0_2.4_1609254471115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_niu_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_niu_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.niu.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_niu_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for SNOMED (sbiobertresolve_snomed_drug)
author: John Snow Labs
name: sbiobertresolve_snomed_drug
date: 2022-01-18
tags: [licensed, snomed, clinical, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps detected drug entities to SNOMED codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings.
## Predicted Entities
`SNOMED Codes`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_drug_en_3.3.4_2.4_1642534694043.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_drug_en_3.3.4_2.4_1642534694043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-----------------+------+-----------------+-----------------+------------------------------------------------------------+------------------------------------------------------------+
| ner_chunk|entity| snomed_code| resolved_text| all_k_results| all_k_resolutions|
+-----------------+------+-----------------+-----------------+------------------------------------------------------------+------------------------------------------------------------+
| Fragmin| DRUG| 9487801000001106| Fragmin|9487801000001106:::130752006:::28999000:::953500100000110...|Fragmin:::Fragilysin:::Fusarin:::Femulen:::Fumonisin:::Fr...|
| OxyContin| DRUG| 9296001000001100| OxyCONTIN|9296001000001100:::373470001:::230091000001108:::55452001...|OxyCONTIN:::Oxychlorosene:::Oxyargin:::oxyCODONE:::Oxymor...|
| folic acid| DRUG| 63718003| Folic acid|63718003:::6247001:::226316008:::432165000:::438451000124...|Folic acid:::Folic acid-containing product:::Folic acid s...|
| levothyroxine| DRUG|10071011000001106| Levothyroxine|10071011000001106:::710809001:::768532006:::126202002:::7...|Levothyroxine:::Levothyroxine (substance):::Levothyroxine...|
| Avandia| DRUG| 9217601000001109| avandia|9217601000001109:::9217501000001105:::12226401000001108::...|avandia:::avandamet:::Anatera:::Intanza:::Avamys:::Aragam...|
| aspirin| DRUG| 387458008| Aspirin|387458008:::7947003:::5145711000001107:::426365001:::4125...|Aspirin:::Aspirin-containing product:::Aspirin powder:::A...|
| Neurontin| DRUG| 9461401000001102| neurontin|9461401000001102:::130694004:::86822004:::952840100000110...|neurontin:::Neurolysin:::Neurine (substance):::Nebilet:::...|
|magnesium citrate| DRUG| 12495006|Magnesium citrate|12495006:::387401007:::21691008:::15531411000001106:::408...|Magnesium citrate:::Magnesium carbonate:::Magnesium trisi...|
| insulin| DRUG| 67866001| Insulin|67866001:::325072002:::414515005:::39487003:::411530000::...|Insulin:::Insulin aspart:::Insulin detemir:::Insulin-cont...|
+-----------------+------+-----------------+-----------------+------------------------------------------------------------+------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_snomed_drug|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[snomed_code]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|false|
|Dependencies:|ner_posology|
## Data Source
Trained on `SNOMED` code dataset with `sbiobert_base_cased_mli` sentence embeddings.
---
layout: model
title: Pipeline to Mapping SNOMED Codes with Their Corresponding ICDO Codes
author: John Snow Labs
name: snomed_icdo_mapping
date: 2023-06-13
tags: [en, licensed, clinical, pipeline, chunk_mapping, snomed, icdo]
task: Chunk Mapping
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of `snomed_icdo_mapper` model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapping_en_4.4.4_3.2_1686665539427.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapping_en_4.4.4_3.2_1686665539427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models")
result = pipeline.fullAnnotate(10376009 2026006 26638004)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models")
val result = pipeline.fullAnnotate(10376009 2026006 26638004)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.snomed_to_icdo.pipe").predict("""Put your text here.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models")
result = pipeline.fullAnnotate(10376009 2026006 26638004)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models")
val result = pipeline.fullAnnotate(10376009 2026006 26638004)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.snomed_to_icdo.pipe").predict("""Put your text here.""")
```
## Results
```bash
Results
| | snomed_code | icdo_code |
|---:|:------------------------------|:-------------------------|
| 0 | 10376009 | 2026006 | 26638004 | 8050/2 | 9014/0 | 8322/0 |
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|snomed_icdo_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|212.8 KB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ChunkMapperModel
---
layout: model
title: English BertForQuestionAnswering Cased model (from chanifrusydi)
author: John Snow Labs
name: bert_qa_chanifrusydi_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `chanifrusydi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_chanifrusydi_finetuned_squad_en_4.0.0_3.0_1657186373688.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_chanifrusydi_finetuned_squad_en_4.0.0_3.0_1657186373688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_chanifrusydi_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_chanifrusydi_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_chanifrusydi_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/chanifrusydi/bert-finetuned-squad
---
layout: model
title: Smaller BERT Sentence Embeddings (L-2_H-512_A-8)
author: John Snow Labs
name: sent_small_bert_L2_512
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L2_512_en_2.6.0_2.4_1598350526043.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L2_512_en_2.6.0_2.4_1598350526043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_512", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_512", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.small_bert_L2_512').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
sentence en_embed_sentence_small_bert_L2_512_embeddings
I hate cancer [0.015892572700977325, 0.21051561832427979, 0....
Antibiotics aren't painkiller [-0.2904765009880066, 0.21515187621116638, 0.1...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_small_bert_L2_512|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|512|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1
---
layout: model
title: Sentence Detection in Sindhi Text
author: John Snow Labs
name: sentence_detector_dl
date: 2021-08-30
tags: [sd, open_source, sentence_detection]
task: Sentence Detection
language: sd
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_sd_3.2.0_3.0_1630337452693.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_sd_3.2.0_3.0_1630337452693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl", "sd") \
.setInputCols(["document"]) \
.setOutputCol("sentences")
sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL]))
sd_model.fullAnnotate("""readingولي رھيا آھن ھڪڙو وڏو ذريعو انگريزي پڙھڻ جا پيراگراف؟ توھان صحيح ھن place تي آيا آھيو. هڪ تازي تحقيق مطابق ا today's جي نوجوانن ۾ پڙهڻ جي عادت تيزيءَ سان گهٽجي رهي آهي. اھي نٿا ڏئي سگھن انگريزي ڏنل پيراگراف تي ڪجھ سيڪنڊن کان و forيڪ لاءِ. پڻ ، پڙهڻ هو ۽ آهي هڪ لازمي حصو س allني مقابلي واري امتحانن جو. تنھنڪري ، توھان پنھنجي پڙھڻ جي صلاحيتن کي ڪيئن بھتر ڪريو ٿا؟ ھن سوال جو جواب اصل ۾ ھڪڙو questionيو سوال آھي: پڙھڻ جي صلاحيتن جو استعمال ا آھي؟ پڙهڻ جو بنيادي مقصد آهي ’احساس ڪرڻ‘.""")
```
```scala
val documenter = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "sd")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val pipeline = new Pipeline().setStages(Array(documenter, model))
val data = Seq("readingولي رھيا آھن ھڪڙو وڏو ذريعو انگريزي پڙھڻ جا پيراگراف؟ توھان صحيح ھن place تي آيا آھيو. هڪ تازي تحقيق مطابق ا today's جي نوجوانن ۾ پڙهڻ جي عادت تيزيءَ سان گهٽجي رهي آهي. اھي نٿا ڏئي سگھن انگريزي ڏنل پيراگراف تي ڪجھ سيڪنڊن کان و forيڪ لاءِ. پڻ ، پڙهڻ هو ۽ آهي هڪ لازمي حصو س allني مقابلي واري امتحانن جو. تنھنڪري ، توھان پنھنجي پڙھڻ جي صلاحيتن کي ڪيئن بھتر ڪريو ٿا؟ ھن سوال جو جواب اصل ۾ ھڪڙو questionيو سوال آھي: پڙھڻ جي صلاحيتن جو استعمال ا آھي؟ پڙهڻ جو بنيادي مقصد آهي ’احساس ڪرڻ‘.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load('sd.sentence_detector').predict("readingولي رھيا آھن ھڪڙو وڏو ذريعو انگريزي پڙھڻ جا پيراگراف؟ توھان صحيح ھن place تي آيا آھيو. هڪ تازي تحقيق مطابق ا today's جي نوجوانن ۾ پڙهڻ جي عادت تيزيءَ سان گهٽجي رهي آهي. اھي نٿا ڏئي سگھن انگريزي ڏنل پيراگراف تي ڪجھ سيڪنڊن کان و forيڪ لاءِ. پڻ ، پڙهڻ هو ۽ آهي هڪ لازمي حصو س allني مقابلي واري امتحانن جو. تنھنڪري ، توھان پنھنجي پڙھڻ جي صلاحيتن کي ڪيئن بھتر ڪريو ٿا؟ ھن سوال جو جواب اصل ۾ ھڪڙو questionيو سوال آھي: پڙھڻ جي صلاحيتن جو استعمال ا آھي؟ پڙهڻ جو بنيادي مقصد آهي ’احساس ڪرڻ‘.", output_level ='sentence')
```
## Results
```bash
+--------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------+
|[readingولي رھيا آھن ھڪڙو وڏو ذريعو انگريزي پڙھڻ جا پيراگراف؟ توھان صحيح ھن place تي آيا آھيو.] |
|[هڪ تازي تحقيق مطابق ا today's جي نوجوانن ۾ پڙهڻ جي عادت تيزيءَ سان گهٽجي رهي آهي.] |
|[اھي نٿا ڏئي سگھن انگريزي ڏنل پيراگراف تي ڪجھ سيڪنڊن کان و forيڪ لاءِ.] |
|[پڻ ، پڙهڻ هو ۽ آهي هڪ لازمي حصو س allني مقابلي واري امتحانن جو.] |
|[تنھنڪري ، توھان پنھنجي پڙھڻ جي صلاحيتن کي ڪيئن بھتر ڪريو ٿا؟ ھن سوال جو جواب اصل ۾ ھڪڙو questionيو سوال آھي:]|
|[پڙھڻ جي صلاحيتن جو استعمال ا آھي؟ پڙهڻ جو بنيادي مقصد آهي ’احساس ڪرڻ‘.] |
+--------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sentence_detector_dl|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[sentences]|
|Language:|sd|
## Benchmarking
```bash
label Accuracy Recall Prec F1
0 0.98 1.00 0.96 0.98
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from avioo1)
author: John Snow Labs
name: distilbert_qa_avioo1_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `avioo1`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_avioo1_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725061133.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_avioo1_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725061133.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_avioo1_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_avioo1_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_avioo1").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_avioo1_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/avioo1/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Telugu Bert Embeddings (from monsoon-nlp)
author: John Snow Labs
name: bert_embeddings_muril_adapted_local
date: 2022-04-11
tags: [bert, embeddings, te, open_source]
task: Embeddings
language: te
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a Telugu model orginally trained by `monsoon-nlp`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_te_3.4.2_3.0_1649675347372.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_te_3.4.2_3.0_1649675347372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","te") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","te")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("te.embed.muril_adapted_local").predict("""నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_muril_adapted_local|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|te|
|Size:|888.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/monsoon-nlp/muril-adapted-local
- https://tfhub.dev/google/MuRIL/1
---
layout: model
title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili
date: 2022-08-01
tags: [sw, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: sw
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-hausa-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`.
## Predicted Entities
`PER`, `LOC`, `ORG`, `DATE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili_sw_4.1.0_3.0_1659353949038.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili_sw_4.1.0_3.0_1659353949038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili","sw") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili","sw")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|sw|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-hausa-finetuned-ner-swahili
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://github.com/masakhane-io/masakhane-ner
---
layout: model
title: Legal Subscription Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_subscription_agreement_bert
date: 2022-11-25
tags: [en, legal, classification, agreement, subscription, licensed, bert]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_subscription_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `subscription-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`subscription-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subscription_agreement_bert_en_1.0.0_3.0_1669372100379.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subscription_agreement_bert_en_1.0.0_3.0_1669372100379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[subscription-agreement]|
|[other]|
|[other]|
|[subscription-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_subscription_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.94 0.97 0.95 65
subscription-agreement 0.94 0.89 0.91 35
accuracy - - 0.94 100
macro-avg 0.94 0.93 0.93 100
weighted-avg 0.94 0.94 0.94 100
```
---
layout: model
title: Legal No solicitation Clause Binary Classifier
author: John Snow Labs
name: legclf_no_solicitation_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `no-solicitation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `no-solicitation`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_solicitation_clause_en_1.0.0_3.2_1660123765936.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_solicitation_clause_en_1.0.0_3.2_1660123765936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[no-solicitation]|
|[other]|
|[other]|
|[no-solicitation]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_no_solicitation_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
no-solicitation 0.93 0.96 0.94 26
other 0.98 0.96 0.97 46
accuracy - - 0.96 72
macro-avg 0.95 0.96 0.96 72
weighted-avg 0.96 0.96 0.96 72
```
---
layout: model
title: Detect entities related to road traffic
author: John Snow Labs
name: ner_traffic
date: 2021-04-01
tags: [ner, clinical, licensed, de]
task: Named Entity Recognition
language: de
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Detect entities related to road traffic using pretrained NER model.
## Predicted Entities
`ORGANIZATION_COMPANY`, `DISASTER_TYPE`, `TIME`, `TRIGGER`, `DATE`, `PERSON`, `LOCATION_STOP`, `ORGANIZATION`, `DISTANCE`, `LOCATION_STREET`, `NUMBER`, `DURATION`, `ORG_POSITION`, `LOCATION_ROUTE`, `LOCATION`, `LOCATION_CITY`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TRAFFIC_DE/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_traffic_de_3.0.0_3.0_1617260858901.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_traffic_de_3.0.0_3.0_1617260858901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_german = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_traffic", "de", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_german, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_german = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_traffic", "de", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_german, ner, ner_converter))
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.med_ner.traffic").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_traffic|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|de|
## Benchmarking
```bash
entity tp fp fn total precision recall f1
DURATION 113.0 34.0 94.0 207.0 0.7687 0.5459 0.6384
ORGANIZATION_COMPANY 667.0 324.0 515.0 1182.0 0.6731 0.5643 0.6139
LOCATION_CITY 441.0 137.0 166.0 607.0 0.763 0.7265 0.7443
LOCATION_ROUTE 132.0 30.0 61.0 193.0 0.8148 0.6839 0.7437
DATE 730.0 81.0 168.0 898.0 0.9001 0.8129 0.8543
PERSON 422.0 84.0 174.0 596.0 0.834 0.7081 0.7659
LOCATION_STREET 132.0 12.0 99.0 231.0 0.9167 0.5714 0.704
LOCATION 697.0 94.0 359.0 1056.0 0.8812 0.66 0.7547
TIME 266.0 34.0 45.0 311.0 0.8867 0.8553 0.8707
TRIGGER 187.0 34.0 192.0 379.0 0.8462 0.4934 0.6233
DISTANCE 99.0 0.0 16.0 115.0 1.0 0.8609 0.9252
NUMBER 608.0 147.0 189.0 797.0 0.8053 0.7629 0.7835
LOCATION_STOP 403.0 53.0 77.0 480.0 0.8838 0.8396 0.8611
macro - - - - - - 0.6528
micro - - - - - - 0.7261
```
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4CHEMD_Modified_pubmed_clinical
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Modified_pubmed_clinical` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Modified_pubmed_clinical_en_4.0.0_3.0_1657108848432.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Modified_pubmed_clinical_en_4.0.0_3.0_1657108848432.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Modified_pubmed_clinical","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Modified_pubmed_clinical","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4CHEMD_Modified_pubmed_clinical|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|407.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4CHEMD-Modified_pubmed_clinical
---
layout: model
title: Turkish BertForTokenClassification Cased model (from busecarik)
author: John Snow Labs
name: bert_token_classifier_loodos_sunlp_ner_turkish
date: 2022-11-30
tags: [tr, open_source, bert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: tr
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-loodos-sunlp-ner-turkish` is a Turkish model originally trained by `busecarik`.
## Predicted Entities
`PRODUCT`, `TIME`, `MONEY`, `ORGANIZATION`, `LOCATION`, `TVSHOW`, `PERSON`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_loodos_sunlp_ner_turkish_tr_4.2.4_3.0_1669815349144.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_loodos_sunlp_ner_turkish_tr_4.2.4_3.0_1669815349144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_loodos_sunlp_ner_turkish","tr") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_loodos_sunlp_ner_turkish","tr")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_loodos_sunlp_ner_turkish|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|tr|
|Size:|412.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/busecarik/bert-loodos-sunlp-ner-turkish
- https://github.com/SU-NLP/SUNLP-Twitter-NER-Dataset
---
layout: model
title: Legal Transactions With Affiliates Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_transactions_with_affiliates_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, transactions_with_affiliates, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Transactions_With_Affiliates` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Transactions_With_Affiliates`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_transactions_with_affiliates_bert_en_1.0.0_3.0_1678050557126.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_transactions_with_affiliates_bert_en_1.0.0_3.0_1678050557126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Transactions_With_Affiliates]|
|[Other]|
|[Other]|
|[Transactions_With_Affiliates]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_transactions_with_affiliates_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 1.00 0.97 0.98 61
Transactions_With_Affiliates 0.95 1.00 0.98 42
accuracy - - 0.98 103
macro-avg 0.98 0.98 0.98 103
weighted-avg 0.98 0.98 0.98 103
```
---
layout: model
title: Extract Temporal Entities from Voice of the Patient Documents (embeddings_clinical_medium)
author: John Snow Labs
name: ner_vop_temporal_emb_clinical_medium
date: 2023-06-06
tags: [licensed, clinical, ner, en, vop, temporal]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts temporal references from the documents transferred from the patient’s own sentences.
## Predicted Entities
`DateTime`, `Frequency`, `Duration`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_emb_clinical_medium_en_4.4.3_3.0_1686076464979.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_emb_clinical_medium_en_4.4.3_3.0_1686076464979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_temporal_emb_clinical_medium", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_temporal_emb_clinical_medium", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:-----------|:------------|
| last month | DateTime |
| yesterday | DateTime |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_temporal_emb_clinical_medium|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.8 MB|
|Dependencies:|embeddings_clinical_medium|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
DateTime 3954 470 448 4402 0.89 0.90 0.90
Frequency 921 190 158 1079 0.83 0.85 0.84
Duration 1952 362 358 2310 0.84 0.85 0.84
macro_avg 6827 1022 964 7791 0.85 0.87 0.86
micro_avg 6827 1022 964 7791 0.87 0.88 0.87
```
---
layout: model
title: Arabic Bert Embeddings (Base)
author: John Snow Labs
name: bert_embeddings_bert_base_arabic
date: 2022-04-11
tags: [bert, embeddings, ar, open_source]
task: Embeddings
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic` is a Arabic model orginally trained by `asafaya`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_ar_3.4.2_3.0_1649677068712.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_ar_3.4.2_3.0_1649677068712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.bert_base_arabic").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_arabic|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|414.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/asafaya/bert-base-arabic
- https://traces1.inria.fr/oscar/
- http://commoncrawl.org/
- https://dumps.wikimedia.org/backup-index.html
- https://github.com/google-research/bert
- https://www.tensorflow.org/tfrc
- https://github.com/alisafaya/Arabic-BERT
---
layout: model
title: Tagalog Electra Embeddings (from jcblaise)
author: John Snow Labs
name: electra_embeddings_electra_tagalog_small_cased_generator
date: 2022-05-17
tags: [tl, open_source, electra, embeddings]
task: Embeddings
language: tl
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-tagalog-small-cased-generator` is a Tagalog model orginally trained by `jcblaise`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_small_cased_generator_tl_3.4.4_3.0_1652786760112.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_small_cased_generator_tl_3.4.4_3.0_1652786760112.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_small_cased_generator","tl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Mahilig ako sa Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_small_cased_generator","tl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Mahilig ako sa Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_electra_tagalog_small_cased_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|tl|
|Size:|18.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/jcblaise/electra-tagalog-small-cased-generator
- https://blaisecruz.com
---
layout: model
title: Fast Neural Machine Translation Model from English to Malayo-Polynesian Languages
author: John Snow Labs
name: opus_mt_en_poz
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, poz, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `poz`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_poz_xx_2.7.0_2.4_1609168000682.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_poz_xx_2.7.0_2.4_1609168000682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_poz", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_poz", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.poz').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_poz|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Reporting Clause Binary Classifier
author: John Snow Labs
name: legclf_reporting_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `reporting` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `reporting`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_reporting_clause_en_1.0.0_3.2_1660123939267.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_reporting_clause_en_1.0.0_3.2_1660123939267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[reporting]|
|[other]|
|[other]|
|[reporting]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_reporting_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.2 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.93 0.93 0.93 154
reporting 0.86 0.87 0.87 78
accuracy - - 0.91 232
macro-avg 0.90 0.90 0.90 232
weighted-avg 0.91 0.91 0.91 232
```
---
layout: model
title: English RobertaForQuestionAnswering (from deepset)
author: John Snow Labs
name: roberta_qa_roberta_base_squad2_covid
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-covid` is a English model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_covid_en_4.0.0_3.0_1655735153729.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_covid_en_4.0.0_3.0_1655735153729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_covid","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_squad2_covid","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_covid.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_squad2_covid|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/deepset/roberta-base-squad2-covid
- https://www.linkedin.com/company/deepset-ai/
- https://github.com/deepset-ai/COVID-QA/blob/master/data/question-answering/200423_covidQA.json
- https://haystack.deepset.ai/community/join
- https://deepset.ai/german-bert
- https://github.com/deepset-ai/FARM
- http://www.deepset.ai/jobs
- https://twitter.com/deepset_ai
- https://github.com/deepset-ai/haystack/discussions
- https://github.com/deepset-ai/haystack/
- https://deepset.ai
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_crossvalidation.py
---
layout: model
title: ChunkResolver Loinc Clinical
author: John Snow Labs
name: chunkresolve_loinc_clinical
date: 2021-04-02
tags: [entity_resolution, clinical, licensed, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance.
## Predicted Entities
LOINC Codes with ``clinical_embeddings``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_LOINC/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_loinc_clinical_en_3.0.0_3.0_1617355407030.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_loinc_clinical_en_3.0.0_3.0_1617355407030.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
loinc_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_loinc_clinical", "en", "clinical/models") \
.setInputCols(["token", "chunk_embeddings"]) \
.setOutputCol("loinc_code") \
.setDistanceFunction("COSINE") \
.setNeighbours(5)
pipeline_loinc = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, loinc_resolver])
data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text")
model = pipeline_loinc.fit(data)
results = model.transform(data)
```
```scala
...
val loinc_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_loinc_clinical", "en", "clinical/models")
.setInputCols(Array("token", "chunk_embeddings"))
.setOutputCol("loinc_code")
.setDistanceFunction("COSINE")
.setNeighbours(5)
val pipeline_loinc = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, loinc_resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.").toDF("text")
val result = pipeline_loinc.fit(data).transform(data)
```
## Results
```bash
Chunk loinc-Code
0 gestational diabetes mellitus 44877-9
1 type two diabetes mellitus 44877-9
2 T2DM 93692-2
3 prior episode of HTG-induced pancreatitis 85695-5
4 associated with an acute hepatitis 24363-4
5 obesity with a body mass index 47278-7
6 BMI) of 33.5 kg/m2 47214-2
7 polyuria 35234-4
8 polydipsia 25541-4
9 poor appetite 50056-1
10 vomiting 34175-0
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|chunkresolve_loinc_clinical|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[token, chunk_embeddings]|
|Output Labels:|[loinc]|
|Language:|en|
---
layout: model
title: English image_classifier_vit_housing_categories ViTForImageClassification from Albe
author: John Snow Labs
name: image_classifier_vit_housing_categories
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_housing_categories` is a English model originally trained by Albe.
## Predicted Entities
`tree house`, `yurt`, `caravan`, `farm`, `castle`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_housing_categories_en_4.1.0_3.0_1660166778182.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_housing_categories_en_4.1.0_3.0_1660166778182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_housing_categories", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_housing_categories", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_housing_categories|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Pipeline to Detect Clinical Entities
author: John Snow Labs
name: ner_jsl_biobert_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_jsl_biobert](https://nlp.johnsnowlabs.com/2021/09/05/ner_jsl_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_pipeline_en_3.4.1_3.0_1647869212989.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_pipeline_en_3.4.1_3.0_1647869212989.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_jsl_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
```scala
val pipeline = new PretrainedPipeline("ner_jsl_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.jsl_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
+-----------------------------------------+----------------------------+
|chunk |ner_label |
+-----------------------------------------+----------------------------+
|21-day-old |Age |
|Caucasian |Race_Ethnicity |
|male |Gender |
|for 2 days |Duration |
|congestion |Symptom |
|mom |Gender |
|suctioning |Modifier |
|yellow discharge |Symptom |
|nares |External_body_part_or_region|
|she |Gender |
|mild |Modifier |
|problems with his breathing while feeding|Symptom |
|perioral cyanosis |Symptom |
|retractions |Symptom |
|One day ago |RelativeDate |
|mom |Gender |
|tactile temperature |Symptom |
|Tylenol |Drug_BrandName |
|Baby |Age |
|decreased p.o |Symptom |
|His |Gender |
|from 20 minutes q.2h. to 5 to 10 minutes |Duration |
|his |Gender |
|respiratory congestion |Symptom |
|He |Gender |
|tired |Symptom |
|fussy |Symptom |
|over the past 2 days |RelativeDate |
|albuterol |Drug_Ingredient |
|ER |Clinical_Dept |
|His |Gender |
|urine output has also decreased |Symptom |
|he |Gender |
|per 24 hours |Frequency |
|he |Gender |
|per 24 hours |Frequency |
|Mom |Gender |
|diarrhea |Symptom |
|His |Gender |
|bowel |Internal_organ_or_component |
+-----------------------------------------+----------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_jsl_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
---
layout: model
title: Chinese Bert Embeddings (Base, MacBERT)
author: John Snow Labs
name: bert_embeddings_chinese_macbert_base
date: 2022-04-11
tags: [bert, embeddings, zh, open_source]
task: Embeddings
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-macbert-base` is a Chinese model orginally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_macbert_base_zh_3.4.2_3.0_1649669049572.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_macbert_base_zh_3.4.2_3.0_1649669049572.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_macbert_base","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_macbert_base","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.chinese_macbert_base").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_macbert_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|384.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/chinese-macbert-base
- https://github.com/ymcui/MacBERT/blob/master/LICENSE
- https://2020.emnlp.org
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/2004.13922
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://github.com/chatopera/Synonyms
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/2004.13922
---
layout: model
title: Legal Exhibits Clause Binary Classifier
author: John Snow Labs
name: legclf_exhibits_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `exhibits` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `exhibits`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exhibits_clause_en_1.0.0_3.2_1660123526366.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exhibits_clause_en_1.0.0_3.2_1660123526366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[exhibits]|
|[other]|
|[other]|
|[exhibits]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_exhibits_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
exhibits 0.95 0.97 0.96 59
other 0.98 0.97 0.97 96
accuracy - - 0.97 155
macro-avg 0.96 0.97 0.97 155
weighted-avg 0.97 0.97 0.97 155
```
---
layout: model
title: Vietnamese Deberta Embeddings model (from binhquoc)
author: John Snow Labs
name: deberta_embeddings_vie_small
date: 2023-03-12
tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, vie, tensorflow]
task: Embeddings
language: vie
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DeBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vie-deberta-small` is a Vietnamese model originally trained by `binhquoc`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_vie_small_vie_4.3.1_3.0_1678626638418.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_vie_small_vie_4.3.1_3.0_1678626638418.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_vie_small","vie") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_vie_small","vie")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|deberta_embeddings_vie_small|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|vie|
|Size:|278.0 MB|
|Case sensitive:|false|
## References
https://huggingface.co/binhquoc/vie-deberta-small
---
layout: model
title: Legal Jurisdictions Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_jurisdictions_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, jurisdictions, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Jurisdictions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Jurisdictions`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_jurisdictions_bert_en_1.0.0_3.0_1678050016981.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_jurisdictions_bert_en_1.0.0_3.0_1678050016981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Jurisdictions]|
|[Other]|
|[Other]|
|[Jurisdictions]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_jurisdictions_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Jurisdictions 0.86 1.00 0.93 19
Other 1.00 0.91 0.95 32
accuracy - - 0.94 51
macro-avg 0.93 0.95 0.94 51
weighted-avg 0.95 0.94 0.94 51
```
---
layout: model
title: English Bert Embeddings (Uncased)
author: John Snow Labs
name: bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `false-positives-scancode-bert-base-uncased-L8-1` is a English model orginally trained by `ayansinha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1_en_3.4.2_3.0_1649672624525.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1_en_3.4.2_3.0_1649672624525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.false_positives_scancode_bert_base_uncased_L8_1").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|410.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/ayansinha/false-positives-scancode-bert-base-uncased-L8-1
- https://github.com/nexB/scancode-results-analyzer
- https://github.com/nexB/scancode-results-analyzer
- https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine
- https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py
- https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py
---
layout: model
title: French CamemBert Embeddings (from ysharma)
author: John Snow Labs
name: camembert_embeddings_ysharma_generic_model_2
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model-2` is a French model orginally trained by `ysharma`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_ysharma_generic_model_2_fr_3.4.4_3.0_1653991037086.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_ysharma_generic_model_2_fr_3.4.4_3.0_1653991037086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_ysharma_generic_model_2","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_ysharma_generic_model_2","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_ysharma_generic_model_2|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/ysharma/dummy-model-2
---
layout: model
title: Multilingual BertForQuestionAnswering model (from mrm8488)
author: John Snow Labs
name: bert_qa_bert_multi_cased_finetuned_xquadv1
date: 2022-06-02
tags: [en, es, de, el, ru, tr, ar, vi, th, zh, hi, open_source, question_answering, bert, xx]
task: Question Answering
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-cased-finetuned-xquadv1` is a Multilingual model orginally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finetuned_xquadv1_xx_4.0.0_3.0_1654184515717.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finetuned_xquadv1_xx_4.0.0_3.0_1654184515717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_multi_cased_finetuned_xquadv1","xx") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_multi_cased_finetuned_xquadv1","xx")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.answer_question.xquad.bert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_multi_cased_finetuned_xquadv1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1
- https://github.com/google-research/bert/blob/master/multilingual.md
- https://twitter.com/mrm8488
- https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl
- https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Try_mrm8488_xquad_finetuned_model.ipynb
- https://github.com/fxsjy/jieba
- https://github.com/deepmind/xquad
---
layout: model
title: Legal Deposit Of Redemption Price Clause Binary Classifier
author: John Snow Labs
name: legclf_deposit_of_redemption_price_clause
date: 2023-01-27
tags: [en, legal, classification, deposit, redemption, price, clauses, deposit_of_redemption_price, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `deposit-of-redemption-price` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`deposit-of-redemption-price`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_deposit_of_redemption_price_clause_en_1.0.0_3.0_1674820606738.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_deposit_of_redemption_price_clause_en_1.0.0_3.0_1674820606738.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[deposit-of-redemption-price]|
|[other]|
|[other]|
|[deposit-of-redemption-price]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_deposit_of_redemption_price_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
deposit-of-redemption-price 0.96 1.00 0.98 22
other 1.00 0.97 0.99 38
accuracy - - 0.98 60
macro-avg 0.98 0.99 0.98 60
weighted-avg 0.98 0.98 0.98 60
```
---
layout: model
title: English BertForMaskedLM Base Cased model (from ayansinha)
author: John Snow Labs
name: bert_embeddings_lic_class_scancode_base_cased_l32_1
date: 2022-12-06
tags: [en, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `lic-class-scancode-bert-base-cased-L32-1` is a English model originally trained by `ayansinha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_lic_class_scancode_base_cased_l32_1_en_4.2.4_3.0_1670326834348.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_lic_class_scancode_base_cased_l32_1_en_4.2.4_3.0_1670326834348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_lic_class_scancode_base_cased_l32_1","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_lic_class_scancode_base_cased_l32_1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_lic_class_scancode_base_cased_l32_1|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|406.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/ayansinha/lic-class-scancode-bert-base-cased-L32-1
- https://github.com/nexB/scancode-results-analyzer
- https://github.com/nexB/scancode-results-analyzer
- https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine
- https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py
- https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py
---
layout: model
title: English BertForQuestionAnswering model (from krinal214)
author: John Snow Labs
name: bert_qa_mBERT_all_ty_SQen_SQ20_1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBERT_all_ty_SQen_SQ20_1` is a English model orginally trained by `krinal214`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mBERT_all_ty_SQen_SQ20_1_en_4.0.0_3.0_1654188214197.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mBERT_all_ty_SQen_SQ20_1_en_4.0.0_3.0_1654188214197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mBERT_all_ty_SQen_SQ20_1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_mBERT_all_ty_SQen_SQ20_1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.multi_lingual_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_mBERT_all_ty_SQen_SQ20_1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/krinal214/mBERT_all_ty_SQen_SQ20_1
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from skandaonsolve)
author: John Snow Labs
name: roberta_qa_finetuned_timeentities2
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-timeentities2` is a English model originally trained by `skandaonsolve`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities2_en_4.3.0_3.0_1674220671794.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities2_en_4.3.0_3.0_1674220671794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_finetuned_timeentities2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|465.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/skandaonsolve/roberta-finetuned-timeentities2
---
layout: model
title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18)
author: John Snow Labs
name: roberta_qa_base_spanish_squades_becasincentivos1
date: 2023-01-20
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becasIncentivos1` is a Spanish model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos1_es_4.3.0_3.0_1674217969589.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos1_es_4.3.0_3.0_1674217969589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos1","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos1","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_spanish_squades_becasincentivos1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|459.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becasIncentivos1
---
layout: model
title: Detect Organism in Medical Texts
author: John Snow Labs
name: bert_token_classifier_ner_linnaeus_species
date: 2022-07-25
tags: [en, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. The model detects species entities in a biomedical text
## Predicted Entities
`SPECIES`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_linnaeus_species_en_4.0.0_3.0_1658755473753.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_linnaeus_species_en_4.0.0_3.0_1658755473753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_linnaeus_species", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("ner")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""First identified in chicken, vigilin homologues have now been found in human (6), Xenopus laevis (7), Drosophila melanogaster (8) and Schizosaccharomyces pombe."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_linnaeus_species", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
.setCaseSensitive(True)
.setMaxSentenceLength(512)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter))
val data = Seq("""First identified in chicken, vigilin homologues have now been found in human (6), Xenopus laevis (7), Drosophila melanogaster (8) and Schizosaccharomyces pombe.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.linnaeus_species").predict("""First identified in chicken, vigilin homologues have now been found in human (6), Xenopus laevis (7), Drosophila melanogaster (8) and Schizosaccharomyces pombe.""")
```
## Results
```bash
+-------------------------+-------+
|ner_chunk |label |
+-------------------------+-------+
|chicken |SPECIES|
|human |SPECIES|
|Xenopus laevis |SPECIES|
|Drosophila melanogaster |SPECIES|
|Schizosaccharomyces pombe|SPECIES|
+-------------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_linnaeus_species|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
[https://github.com/cambridgeltl/MTL-Bioinformatics-2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016)
## Benchmarking
```bash
label precision recall f1-score support
B-SPECIES 0.6391 0.9204 0.7544 1433
I-SPECIES 0.8297 0.7071 0.7635 799
micro-avg 0.6863 0.8441 0.7571 2232
macro-avg 0.7344 0.8138 0.7589 2232
weighted-avg 0.7073 0.8441 0.7576 2232
```
---
layout: model
title: Finnish asr_wav2vec2_xlsr_1b_finnish TFWav2Vec2ForCTC from aapot
author: John Snow Labs
name: pipeline_asr_wav2vec2_xlsr_1b_finnish
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish` is a Finnish model originally trained by aapot.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_1b_finnish_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_fi_4.2.0_3.0_1664018763154.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_fi_4.2.0_3.0_1664018763154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_1b_finnish', lang = 'fi')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_1b_finnish", lang = "fi")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xlsr_1b_finnish|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|3.6 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Icelandic NER Pipeline
author: John Snow Labs
name: roberta_token_classifier_icelandic_ner_pipeline
date: 2022-04-20
tags: [open_source, ner, token_classifier, roberta, icelandic, is]
task: Named Entity Recognition
language: is
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [roberta_token_classifier_icelandic_ner](https://nlp.johnsnowlabs.com/2021/12/06/roberta_token_classifier_icelandic_ner_is.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_icelandic_ner_pipeline_is_3.4.1_3.0_1650453946425.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_icelandic_ner_pipeline_is_3.4.1_3.0_1650453946425.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("roberta_token_classifier_icelandic_ner_pipeline", lang = "is")
pipeline.annotate("Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.")
```
```scala
val pipeline = new PretrainedPipeline("roberta_token_classifier_icelandic_ner_pipeline", lang = "is")
pipeline.annotate("Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.")
```
## Results
```bash
+----------------+------------+
|chunk |ner_label |
+----------------+------------+
|Peter Fergusson |Person |
|New York |Location |
|október 2011 |Date |
|Tesla Motor |Organization|
|100K $ |Money |
+----------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_token_classifier_icelandic_ner_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Language:|is|
|Size:|457.5 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- RoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Clinical Deidentification (English, Glove, Augmented)
author: John Snow Labs
name: clinical_deidentification_glove_augmented
date: 2022-09-16
tags: [en, deid, deidentification, licensed, clinical, glove, pipeline]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline is trained with lightweight `glove_100d` embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR` entities.
It's different to `clinical_deidentification_glove` in the way it manages PHONE and PATIENT, having apart from the NER, some rules in Contextual Parser components.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_augmented_en_4.1.0_3.2_1663311659491.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_augmented_en_4.1.0_3.2_1663311659491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification_glove_augmented", "en", "clinical/models")
deid_pipeline.annotate("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN: 324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = PretrainedPipeline("clinical_deidentification_glove_augmented", "en", "clinical/models")
val result = pipeline.annotate("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN: 324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.glove_augmented.pipeline").predict("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN: 324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""")
```
## Results
```bash
{'masked': ['Record date : , , M.D.',
'IP: .',
"The driver's license no: .",
'The SSN: and e-mail: .',
'Name : MR. # Date : .',
'PCP : , years old.',
'Record date : , : .'],
'masked_fixed_length_chars': ['Record date : ****, ****, M.D.',
'IP: ****.',
"The driver's license no: ****.",
'The SSN: **** and e-mail: ****.',
'Name : **** MR. # **** Date : ****.',
'PCP : ****, **** years old.',
'Record date : ****, **** : ****.'],
'masked_with_chars': ['Record date : [********], [********], M.D.',
'IP: [************].',
"The driver's license no: [******].",
'The SSN: [*******] and e-mail: [************].',
'Name : [**************] MR. # [****] Date : [******].',
'PCP : [******], ** years old.',
'Record date : [********], [***********] : [***************].'],
'ner_chunk': ['2093-01-13',
'David Hale',
'A334455B',
'324598674',
'hale@gmail.com',
'Hendrickson, Ora',
'719435',
'01/13/93',
'Oliveira',
'25',
'2079-11-09',
"Patient's VIN",
'1HGBH41JXMN109286'],
'obfuscated': ['Record date : 2093-01-23, Dr Marshia Curling, M.D.',
'IP: 004.004.004.004.',
"The driver's license no: 123XX123.",
'The SSN: SSN-089-89-9294 and e-mail: Mikey@hotmail.com.',
'Name : Stephania Chang MR. # E5881795 Date : 02-14-1983.',
'PCP : Dr Lovella Israel, 52 years old.',
'Record date : 2079-11-14, Dr Colie Carne : 3CCCC22DDDD333888.'],
'sentence': ['Record date : 2093-01-13, David Hale, M.D.',
'IP: 203.120.223.13.',
"The driver's license no: A334455B.",
'The SSN: 324598674 and e-mail: hale@gmail.com.',
'Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93.',
'PCP : Oliveira, 25 years old.',
"Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286."]}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification_glove_augmented|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|181.3 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- ChunkMergeModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- ChunkMergeModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- Finisher
---
layout: model
title: Multilingual XLMRobertaForTokenClassification Base Cased model (from Davlan)
author: John Snow Labs
name: xlmroberta_ner_base_sadilar
date: 2022-08-01
tags: [xx, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-sadilar-ner` is a Multilingual model originally trained by `Davlan`.
## Predicted Entities
`DATE`, `PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_sadilar_xx_4.1.0_3.0_1659356675311.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_sadilar_xx_4.1.0_3.0_1659356675311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_sadilar","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_sadilar","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_sadilar|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|806.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Davlan/xlm-roberta-base-sadilar-ner
- https://www.sadilar.org/index.php/en/
---
layout: model
title: Company Name to IRS (Edgar database)
author: John Snow Labs
name: legel_edgar_irs
date: 2022-08-30
tags: [en, legal, companies, edgar, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is an Entity Linking / Entity Resolution model, which allows you to retrieve the IRS number of a company given its name, using SEC Edgar database.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/ER_EDGAR_CRUNCHBASE/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legel_edgar_irs_en_1.0.0_3.2_1661866500067.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legel_edgar_irs_en_1.0.0_3.2_1661866500067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+
| chunk| code | all_codes| resolutions | all_distances|
+--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+
| CONTACT GOLD | 981369960| [981369960, 271989147, 208531222, 273566922, 270348508] |[981369960, 271989147, 208531222, 273566922, 270348508] | [0.1733, 0.3700, 0.3867, 0.4103, 0.4121] |
+--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legel_edgar_irs|
|Type:|legal|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[company_irs_number]|
|Language:|en|
|Size:|313.8 MB|
|Case sensitive:|false|
## References
In-house scrapping and postprocessing of SEC Edgar Database
---
layout: model
title: Translate Yapese to English Pipeline
author: John Snow Labs
name: translate_yap_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, yap, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `yap`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_yap_en_xx_2.7.0_2.4_1609686209390.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_yap_en_xx_2.7.0_2.4_1609686209390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_yap_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_yap_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.yap.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_yap_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English image_classifier_vit_vliegmachine ViTForImageClassification from johnnydevriese
author: John Snow Labs
name: image_classifier_vit_vliegmachine
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vliegmachine` is a English model originally trained by johnnydevriese.
## Predicted Entities
`f117`, `f16`, `f18`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vliegmachine_en_4.1.0_3.0_1660166009742.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vliegmachine_en_4.1.0_3.0_1660166009742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_vliegmachine", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_vliegmachine", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_vliegmachine|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English RobertaForMaskedLM Large Cased model
author: John Snow Labs
name: roberta_embeddings_large
date: 2022-12-12
tags: [en, open_source, roberta_embeddings, robertaformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large` is a English model originally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_large_en_4.2.4_3.0_1670859597088.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_large_en_4.2.4_3.0_1670859597088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_large","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_large","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_large|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|847.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/roberta-large
- https://arxiv.org/abs/1907.11692
- https://github.com/pytorch/fairseq/tree/master/examples/roberta
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
- https://commoncrawl.org/2016/10/news-dataset-available/
- https://github.com/jcpeterson/openwebtext
- https://arxiv.org/abs/1806.02847
---
layout: model
title: Pipeline to Detect Anatomical References (biobert)
author: John Snow Labs
name: ner_anatomy_biobert_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_anatomy_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_anatomy_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_pipeline_en_3.4.1_3.0_1647873806641.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_pipeline_en_3.4.1_3.0_1647873806641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_anatomy_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.")
```
```scala
val pipeline = new PretrainedPipeline("ner_anatomy_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.anatomy_biobert.pipeline").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""")
```
## Results
```bash
+-------------------+----------------------+
|chunk |ner_label |
+-------------------+----------------------+
|right |Organism_subdivision |
|great |Organism_subdivision |
|toe |Organism_subdivision |
|skin |Organ |
|Sclerae |Pathological_formation|
|Extraocular muscles|Multi-tissue_structure|
|Nares |Organ |
|turbinates |Multi-tissue_structure|
|Mucous membranes |Cell |
|Abdomen |Organism_subdivision |
|bowel |Organism_subdivision |
|right |Organism_subdivision |
|toe |Organism_subdivision |
|skin |Organ |
|toenails |Organism_subdivision |
|foot |Organism_subdivision |
|toe |Organism_subdivision |
|toenails |Organism_subdivision |
+-------------------+----------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_anatomy_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.1 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
---
layout: model
title: Translate English to Slavic languages Pipeline
author: John Snow Labs
name: translate_en_sla
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, sla, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `sla`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sla_xx_2.7.0_2.4_1609687363449.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sla_xx_2.7.0_2.4_1609687363449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_sla", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_sla", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.sla').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_sla|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Environmental matters Clause Binary Classifier
author: John Snow Labs
name: legclf_environmental_matters_clause
date: 2022-09-28
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `environmental-matters` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `environmental-matters`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_environmental_matters_clause_en_1.0.0_3.0_1664363148554.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_environmental_matters_clause_en_1.0.0_3.0_1664363148554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[environmental-matters]|
|[other]|
|[other]|
|[environmental-matters]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_environmental_matters_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
environmental-matters 0.95 0.86 0.90 21
other 0.94 0.98 0.96 48
accuracy - - 0.94 69
macro-avg 0.94 0.92 0.93 69
weighted-avg 0.94 0.94 0.94 69
```
---
layout: model
title: ALBERT Embeddings (Base Uncase)
author: John Snow Labs
name: albert_base_uncased
date: 2020-04-28
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: AlBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The details are described in the paper "[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.](https://arxiv.org/abs/1909.11942)"
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_base_uncased_en_2.5.0_2.4_1588073363475.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_base_uncased_en_2.5.0_2.4_1588073363475.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = AlbertEmbeddings.pretrained("albert_base_uncased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = AlbertEmbeddings.pretrained("albert_base_uncased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.albert.base_uncased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_albert_base_uncased_embeddings
I [1.0153148174285889, 0.5481745600700378, -0.44...
love [0.3452114760875702, -1.191628336906433, 0.423...
NLP [-0.4268064796924591, -0.3819553852081299, 0.8...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_base_uncased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/google/albert_base/3](https://tfhub.dev/google/albert_base/3)
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578 TFWav2Vec2ForCTC from doddle124578
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578` is a English model originally trained by doddle124578.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_en_4.2.0_3.0_1664037269433.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_en_4.2.0_3.0_1664037269433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: Detect Anatomical and Observation Entities in Chest Radiology Reports (CheXpert)
author: John Snow Labs
name: ner_chexpert
date: 2021-09-30
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.3.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts `Anatomical` and `Observation` entities from Chest Radiology Reports.
## Predicted Entities
`ANAT - Anatomy`, `OBS - Observation`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RADIOLOGY/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chexpert_en_3.3.0_3.0_1633010671460.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chexpert_en_3.3.0_3.0_1633010671460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_chexpert", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base."]], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_chexpert", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val data = Seq("""FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.chexpert").predict("""FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.""")
```
## Results
```bash
| | chunk | label |
|---:|:-------------------------|:--------|
| 0 | endotracheal tube | OBS |
| 1 | Swan - Ganz catheter | OBS |
| 2 | left chest | ANAT |
| 3 | tube | OBS |
| 4 | in place | OBS |
| 5 | pneumothorax | OBS |
| 6 | Mild atelectatic changes | OBS |
| 7 | left base | ANAT |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_chexpert|
|Compatibility:|Healthcare NLP 3.3.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
Trained on CheXpert dataset explain in https://arxiv.org/pdf/2106.14463.pdf.
## Benchmarking
```bash
label tp fp fn prec rec f1
I-ANAT_DP 26 11 11 0.7027027 0.7027027 0.7027027
B-OBS_DP 1489 141 104 0.9134969 0.9347144 0.9239839
I-OBS_DP 16 3 54 0.84210527 0.22857143 0.35955057
B-ANAT_DP 1125 39 45 0.96649486 0.96153843 0.96401024
Macro-average 2656 194 214 0.8561999 0.70688176 0.7744088
Micro-average 2656 194 214 0.9319298 0.92543554 0.9286713
```
---
layout: model
title: Legal Waivers Clause Binary Classifier
author: John Snow Labs
name: legclf_waivers_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `waivers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `waivers`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_waivers_clause_en_1.0.0_3.2_1660124127062.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_waivers_clause_en_1.0.0_3.2_1660124127062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[waivers]|
|[other]|
|[other]|
|[waivers]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_waivers_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.2 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.96 0.98 0.97 324
waivers 0.96 0.90 0.93 128
accuracy - - 0.96 452
macro-avg 0.96 0.94 0.95 452
weighted-avg 0.96 0.96 0.96 452
```
---
layout: model
title: Detect PHI for Deidentification (Sub Entity)
author: John Snow Labs
name: ner_deid_subentity
date: 2022-01-06
tags: [de, deid, ner, licensed]
task: Named Entity Recognition
language: de
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER is a Named Entity Recognition model that annotates German text to find protected health information (PHI) that may need to be deidentified. It was trained with in-house annotations and detects 12 entities.
## Predicted Entities
`PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_DE){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_de_3.3.4_2.4_1641460993460.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_de_3.3.4_2.4_1641460993460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
deid_ner = MedicalNerModel.pretrained("ner_deid_subentity", "de", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_deid_subentity_chunk")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
deid_ner,
ner_converter])
data = spark.createDataFrame([["""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus
in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val deid_ner = MedicalNerModel.pretrained("ner_deid_subentity", "de", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_deid_subentity_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
deid_ner,
ner_converter))
val data = Seq("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhausin Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""").toDS.toDF("text")
val result = nlpPipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.med_ner.deid_subentity").predict("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus
in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""")
```
## Results
```bash
+-------------------------+-------------------------+
|chunk |ner_deid_subentity_chunk |
+-------------------------+-------------------------+
|Michael Berger |PATIENT |
|12 Dezember 2018 |DATE |
|St. Elisabeth-Krankenhaus|HOSPITAL |
|Bad Kissingen |CITY |
|Berger |PATIENT |
|76 |AGE |
+-------------------------+-------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_subentity|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|15.0 MB|
## Data Source
In-house annotated dataset
## Benchmarking
```bash
label tp fp fn total precision recall f1
PATIENT 2080.0 58.0 74.0 2154.0 0.9729 0.9656 0.9692
HOSPITAL 1598.0 4.0 0.0 1598.0 0.9975 1.0 0.9988
DATE 4047.0 7.0 2.0 4049.0 0.9983 0.9995 0.9989
ORGANIZATION 1288.0 108.0 67.0 1355.0 0.9226 0.9506 0.9364
CITY 196.0 1.0 4.0 200.0 0.9949 0.98 0.9874
STREET 124.0 1.0 4.0 128.0 0.992 0.9688 0.9802
USERNAME 45.0 0.0 0.0 45.0 1.0 1.0 1.0
PROFESSION 262.0 1.0 0.0 262.0 0.9962 1.0 0.9981
PHONE 71.0 10.0 9.0 80.0 0.8765 0.8875 0.882
COUNTRY 306.0 5.0 6.0 312.0 0.9839 0.9808 0.9823
DOCTOR 1414.0 9.0 39.0 1453.0 0.9937 0.9732 0.9833
AGE 473.0 3.0 3.0 476.0 0.9937 0.9937 0.9937
```
---
layout: model
title: English BertForMaskedLM Base Uncased model
author: John Snow Labs
name: bert_embeddings_base_uncased
date: 2022-12-02
tags: [en, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased` is a English model originally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_uncased_en_4.2.4_3.0_1670019190911.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_uncased_en_4.2.4_3.0_1670019190911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_uncased","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_uncased","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_uncased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|409.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/bert-base-uncased
- https://arxiv.org/abs/1810.04805
- https://github.com/google-research/bert
- https://github.com/google-research/bert/blob/master/README.md
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
---
layout: model
title: Fast Neural Machine Translation Model from English to Berber
author: John Snow Labs
name: opus_mt_en_ber
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ber, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ber`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ber_xx_2.7.0_2.4_1609169805124.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ber_xx_2.7.0_2.4_1609169805124.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ber", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ber", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ber').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ber|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Indemnification Clause Binary Classifier (md)
author: John Snow Labs
name: legclf_indemnification_md
date: 2022-11-25
tags: [en, legal, classification, document, agreement, contract, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `indemnification` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `indemnification`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_md_en_1.0.0_3.0_1669376491781.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_md_en_1.0.0_3.0_1669376491781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[indemnification]|
|[other]|
|[other]|
|[indemnification]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_indemnification_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
precision recall f1-score support
indemnification-and-contribution 0.84 0.93 0.88 28
other 0.94 0.87 0.91 39
accuracy 0.90 67
macro avg 0.89 0.90 0.89 67
weighted avg 0.90 0.90 0.90 67
```
---
layout: model
title: Recognize Entities OntoNotes pipeline - BERT Small
author: John Snow Labs
name: onto_recognize_entities_bert_small
date: 2021-03-22
tags: [open_source, english, onto_recognize_entities_bert_small, pipeline, en]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The onto_recognize_entities_bert_small is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_small_en_3.0.0_3.0_1616443983762.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_small_en_3.0.0_3.0_1616443983762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('onto_recognize_entities_bert_small', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_small", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.ner.onto.bert.small').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:----------------------------------------------------|:-------------------|
| 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[0.9379079937934875,.,...]] | ['O', 'O', 'B-PERSON', 'I-PERSON', 'I-PERSON', 'O'] | ['John Snow Labs'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|onto_recognize_entities_bert_small|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Swty)
author: John Snow Labs
name: distilbert_qa_swty_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Swty`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_swty_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769390900.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_swty_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769390900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_swty_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_swty_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_swty_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Swty/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Pipeline to Mapping RxNORM Codes with Their Corresponding UMLS Codes
author: John Snow Labs
name: rxnorm_umls_mapping
date: 2023-06-13
tags: [en, licensed, clinical, pipeline, chunk_mapping, rxnorm, umls]
task: Chunk Mapping
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of `rxnorm_umls_mapper` model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapping_en_4.4.4_3.2_1686663532119.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapping_en_4.4.4_3.2_1686663532119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models")
result = pipeline.fullAnnotate(1161611 315677)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models")
val result = pipeline.fullAnnotate(1161611 315677)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.rxnorm.umls.mapping").predict("""Put your text here.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models")
result = pipeline.fullAnnotate(1161611 315677)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models")
val result = pipeline.fullAnnotate(1161611 315677)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.rxnorm.umls.mapping").predict("""Put your text here.""")
```
## Results
```bash
Results
| | rxnorm_code | umls_code |
|---:|:-----------------|:--------------------|
| 0 | 1161611 | 315677 | C3215948 | C0984912 |
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|rxnorm_umls_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.9 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ChunkMapperModel
---
layout: model
title: Legal Cancellation Clause Binary Classifier
author: John Snow Labs
name: legclf_cancellation_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `cancellation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `cancellation`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cancellation_clause_en_1.0.0_3.2_1660122195560.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cancellation_clause_en_1.0.0_3.2_1660122195560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[cancellation]|
|[other]|
|[other]|
|[cancellation]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_cancellation_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.1 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
cancellation 0.92 0.92 0.92 38
other 0.97 0.97 0.97 96
accuracy - - 0.96 134
macro-avg 0.94 0.94 0.94 134
weighted-avg 0.96 0.96 0.96 134
```
---
layout: model
title: German asr_wav2vec2_large_xlsr_53_german_by_marcel TFWav2Vec2ForCTC from marcel
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_marcel` is a German model originally trained by marcel.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel_de_4.2.0_3.0_1664101959541.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel_de_4.2.0_3.0_1664101959541.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel', lang = 'de')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel", lang = "de")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Czech asr_wav2vec2_large_xlsr_53_Czech TFWav2Vec2ForCTC from MehdiHosseiniMoghadam
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_Czech
date: 2022-09-25
tags: [wav2vec2, cs, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: cs
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_Czech` is a Czech model originally trained by MehdiHosseiniMoghadam.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_Czech_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_Czech_cs_4.2.0_3.0_1664119968423.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_Czech_cs_4.2.0_3.0_1664119968423.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_Czech', lang = 'cs')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_Czech", lang = "cs")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_Czech|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|cs|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Hausa asr_Hausa_xlsr TFWav2Vec2ForCTC from Akashpb13
author: John Snow Labs
name: asr_Hausa_xlsr
date: 2022-09-26
tags: [wav2vec2, ha, audio, open_source, asr]
task: Automatic Speech Recognition
language: ha
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Hausa_xlsr` is a Hausa model originally trained by Akashpb13.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Hausa_xlsr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Hausa_xlsr_ha_4.2.0_3.0_1664192959924.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Hausa_xlsr_ha_4.2.0_3.0_1664192959924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_Hausa_xlsr", "ha")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_Hausa_xlsr", "ha")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_Hausa_xlsr|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|ha|
|Size:|1.2 GB|
---
layout: model
title: Legal No waiver Clause Binary Classifier
author: John Snow Labs
name: legclf_no_waiver_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `no-waiver` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `no-waiver`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_waiver_clause_en_1.0.0_3.2_1660122721126.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_waiver_clause_en_1.0.0_3.2_1660122721126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[no-waiver]|
|[other]|
|[other]|
|[no-waiver]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_no_waiver_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
no-waiver 0.95 0.98 0.97 43
other 0.99 0.98 0.99 119
accuracy - - 0.98 162
macro-avg 0.97 0.98 0.98 162
weighted-avg 0.98 0.98 0.98 162
```
---
layout: model
title: Stopwords Remover for Armenian language (105 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, hy, open_source]
task: Stop Words Removal
language: hy
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_hy_3.4.1_3.0_1646672921601.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_hy_3.4.1_3.0_1646672921601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","hy") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Դու ինձանից ավելի լավն ես"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","hy")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Դու ինձանից ավելի լավն ես").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("hy.stopwords").predict("""Դու ինձանից ավելի լավն ես""")
```
## Results
```bash
+----------------------+
|result |
+----------------------+
|[ինձանից, ավելի, լավն]|
+----------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|hy|
|Size:|1.8 KB|
---
layout: model
title: Fast Neural Machine Translation Model from Central Bikol to Spanish
author: John Snow Labs
name: opus_mt_bcl_es
date: 2021-06-01
tags: [open_source, seq2seq, translation, bcl, es, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: bcl
target languages: es
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_es_xx_3.1.0_2.4_1622561503404.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_es_xx_3.1.0_2.4_1622561503404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_bcl_es", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_bcl_es", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Central Bikol.translate_to.Spanish').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_bcl_es|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English Deberta Embeddings model (from domenicrosati)
author: John Snow Labs
name: deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt
date: 2023-03-12
tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow]
task: Embeddings
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DeBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-v3-large-dapt-scientific-papers-pubmed-tapt` is a English model originally trained by `domenicrosati`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt_en_4.3.1_3.0_1678658548832.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt_en_4.3.1_3.0_1678658548832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.6 GB|
|Case sensitive:|false|
## References
https://huggingface.co/domenicrosati/deberta-v3-large-dapt-scientific-papers-pubmed-tapt
---
layout: model
title: Javanese DistilBERT Embeddings (Small, Wikipedia)
author: John Snow Labs
name: distilbert_embeddings_javanese_distilbert_small
date: 2022-04-12
tags: [distilbert, embeddings, jv, open_source]
task: Embeddings
language: jv
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `javanese-distilbert-small` is a Javanese model orginally trained by `w11wo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_distilbert_small_jv_3.4.2_3.0_1649783759354.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_distilbert_small_jv_3.4.2_3.0_1649783759354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_distilbert_small","jv") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_distilbert_small","jv")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("jv.embed.distilbert").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_javanese_distilbert_small|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|jv|
|Size:|248.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/w11wo/javanese-distilbert-small
- https://arxiv.org/abs/1910.01108
- https://github.com/sgugger
- https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb
- https://w11wo.github.io/
---
layout: model
title: Legal Vacations Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_vacations_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, vacations, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Vacations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Vacations`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_vacations_bert_en_1.0.0_3.0_1678050711001.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_vacations_bert_en_1.0.0_3.0_1678050711001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Vacations]|
|[Other]|
|[Other]|
|[Vacations]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_vacations_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 1.00 0.98 0.99 56
Vacations 0.97 1.00 0.99 36
accuracy - - 0.99 92
macro-avg 0.99 0.99 0.99 92
weighted-avg 0.99 0.99 0.99 92
```
---
layout: model
title: T5 text-to-text model
author: John Snow Labs
name: t5_small
date: 2020-12-21
task: [Question Answering, Summarization, Translation]
language: en
nav_key: models
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, t5, en]
supported: true
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The T5 transformer model described in the seminal paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". This model can perform a variety of tasks, such as text summarization, question answering and translation. More details about using the model can be found in the paper (https://arxiv.org/pdf/1910.10683.pdf).
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/T5TRANSFORMER/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5TRANSFORMER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_en_2.7.0_2.4_1608554292913.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_en_2.7.0_2.4_1608554292913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
t5 = T5Transformer() \
.pretrained("t5_small") \
.setTask("summarize:")\
.setMaxOutputLength(200)\
.setInputCols(["documents"]) \
.setOutputCol("summaries")
pipeline = Pipeline().setStages([document_assembler, t5])
results = pipeline.fit(data_df).transform(data_df)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("documents")
val t5 = T5Transformer
.pretrained("t5_small")
.setTask("summarize:")
.setInputCols(Array("documents"))
.setOutputCol("summaries")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val result = pipeline.fit(dataDf).transform(dataDf)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.t5.small").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_small|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|en|
## Data Source
C4
---
layout: model
title: German BertForMaskedLM Base Uncased model (from dbmdz)
author: John Snow Labs
name: bert_embeddings_base_german_uncased
date: 2022-12-02
tags: [de, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: de
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-german-uncased` is a German model originally trained by `dbmdz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_german_uncased_de_4.2.4_3.0_1670017726606.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_german_uncased_de_4.2.4_3.0_1670017726606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_german_uncased","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_german_uncased","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_german_uncased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|412.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/dbmdz/bert-base-german-uncased
- https://deepset.ai/german-bert
- https://deepset.ai/
- https://spacy.io/
- https://github.com/allenai/scibert
- https://github.com/stefan-it/fine-tuned-berts-seq
- https://github.com/dbmdz/berts/issues/new
---
layout: model
title: Detect Cellular/Molecular Biology Entities (clinical_large)
author: John Snow Labs
name: ner_cellular_emb_clinical_large
date: 2023-05-24
tags: [ner, en, licensed, clinical, dna, rna, protein, cell_line, cell_type]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for molecular biology related terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
`DNA`, `Cell_type`, `Cell_line`, `RNA`, `Protein`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CELLULAR/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CELLULAR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_emb_clinical_large_en_4.4.2_3.0_1684920548062.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_emb_clinical_large_en_4.4.2_3.0_1684920548062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_cellular_emb_clinical_large", "en", "clinical/models")\
.setInputCols(["sentence", "token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(['sentence', 'token', 'ner'])\
.setOutputCol('ner_chunk')
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
])
sample_df = spark.createDataFrame([["""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive."""]]).toDF("text")
result = pipeline.fit(sample_df).transform(sample_df)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_cellular_emb_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter))
val sample_data = Seq("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""").toDS.toDF("text")
val result = pipeline.fit(sample_data).transform(sample_data)
```
## Results
```bash
+-------------------------------------------+-----+---+---------+
|chunk |begin|end|ner_label|
+-------------------------------------------+-----+---+---------+
|intracellular signaling proteins |27 |58 |protein |
|human T-cell leukemia virus type 1 promoter|130 |172|DNA |
|Tax |186 |188|protein |
|Tax-responsive element 1 |193 |216|DNA |
|cyclic AMP-responsive members |237 |265|protein |
|CREB/ATF family |274 |288|protein |
|transcription factors |293 |313|protein |
|Tax |389 |391|protein |
|Tax-responsive element 1 |431 |454|DNA |
|TRE-1 |457 |461|DNA |
|lacZ gene |582 |590|DNA |
|CYC1 promoter |617 |629|DNA |
|TRE-1 |663 |667|DNA |
|cyclic AMP response element-binding protein|695 |737|protein |
|CREB |740 |743|protein |
|CREB |749 |752|protein |
|GAL4 activation domain |767 |788|protein |
|GAD |791 |793|protein |
|reporter gene |848 |860|DNA |
|Tax |863 |865|protein |
+-------------------------------------------+-----+---+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_cellular_emb_clinical_large|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|2.8 MB|
## References
Trained on the JNLPBA corpus containing more than 2.404 publication abstracts. (http://www.geniaproject.org/)
## Benchmarking
```bash
label precision recall f1-score support
cell_type 0.89 0.79 0.84 4912
protein 0.80 0.90 0.84 9841
cell_line 0.66 0.75 0.70 1489
DNA 0.78 0.87 0.82 2845
RNA 0.79 0.81 0.80 305
micro-avg 0.80 0.85 0.83 19392
macro-avg 0.78 0.82 0.80 19392
weighted-avg 0.81 0.85 0.83 19392
```
---
layout: model
title: BERT multilingual base model (cased)
author: John Snow Labs
name: bert_base_multilingual_cased
date: 2021-05-20
tags: [xx, multilingual, embeddings, bert, open_source]
supported: true
task: Embeddings
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is case sensitive: it makes a difference between english and English.
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
was pretrained with two objectives:
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
- Next sentence prediction (NSP): the models concatenate two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_multilingual_cased_xx_3.1.0_2.4_1621519556711.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_multilingual_cased_xx_3.1.0_2.4_1621519556711.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = BertEmbeddings.pretrained("bert_base_multilingual_cased", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
```
```scala
val embeddings = BertEmbeddings.pretrained("bert_base_multilingual_cased", "xx")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.embed.bert_base_multilingual_cased").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_multilingual_cased|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Case sensitive:|true|
## Data Source
[https://huggingface.co/bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
---
layout: model
title: Extract Oncology Tests
author: John Snow Labs
name: ner_oncology_test
date: 2022-11-24
tags: [licensed, clinical, oncology, en, ner, test]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts mentions of tests from oncology texts, including pathology tests and imaging tests.
Definitions of Predicted Entities:
- `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category.
- `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers.
- `Imaging_Test`: Imaging tests mentioned in texts, such as "chest CT scan".
- `Oncogene`: Mentions of genes that are implicated in the etiology of cancer.
- `Pathology_Test`: Mentions of biopsies or tests that use tissue samples.
## Predicted Entities
`Biomarker`, `Biomarker_Result`, `Imaging_Test`, `Oncogene`, `Pathology_Test`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_en_4.2.2_3.0_1669307746859.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_en_4.2.2_3.0_1669307746859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_test", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_test", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_test").predict("""A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.""")
```
## Results
```bash
| chunk | ner_label |
|:-------------------------------|:---------------|
| biopsy | Pathology_Test |
| ultrasound guided thick-needle | Pathology_Test |
| chest computed tomography | Imaging_Test |
| CT | Imaging_Test |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_test|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|34.2 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Imaging_Test 2020 229 184 2204 0.90 0.92 0.91
Biomarker_Result 1177 186 268 1445 0.86 0.81 0.84
Pathology_Test 888 276 162 1050 0.76 0.85 0.80
Biomarker 1287 254 228 1515 0.84 0.85 0.84
Oncogene 365 89 84 449 0.80 0.81 0.81
macro_avg 5737 1034 926 6663 0.83 0.85 0.84
micro_avg 5737 1034 926 6663 0.85 0.86 0.85
```
---
layout: model
title: English BertForQuestionAnswering model (from deepset)
author: John Snow Labs
name: bert_base_cased_qa_squad2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-squad2` is a English model orginally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_cased_qa_squad2_en_4.0.0_3.0_1654193845988.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_qa_squad2_en_4.0.0_3.0_1654193845988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_base_cased_qa_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_base_cased_qa_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert.base_cased.by_deepset").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_cased_qa_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/deepset/bert-base-cased-squad2
---
layout: model
title: Legal Specific performance Clause Binary Classifier
author: John Snow Labs
name: legclf_specific_performance_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `specific-performance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `specific-performance`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_specific_performance_clause_en_1.0.0_3.2_1660123020947.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_specific_performance_clause_en_1.0.0_3.2_1660123020947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[specific-performance]|
|[other]|
|[other]|
|[specific-performance]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_specific_performance_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 0.98 0.99 94
specific-performance 0.95 1.00 0.97 36
accuracy - - 0.98 130
macro-avg 0.97 0.99 0.98 130
weighted-avg 0.99 0.98 0.98 130
```
---
layout: model
title: Legal Counterparts Clause Binary Classifier
author: John Snow Labs
name: legclf_counterparts_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `counterparts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `counterparts`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_counterparts_clause_en_1.0.0_3.2_1660123374262.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_counterparts_clause_en_1.0.0_3.2_1660123374262.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[counterparts]|
|[other]|
|[other]|
|[counterparts]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_counterparts_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
counterparts 1.00 1.00 1.00 38
other 1.00 1.00 1.00 97
accuracy - - 1.00 135
macro-avg 1.00 1.00 1.00 135
weighted-avg 1.00 1.00 1.00 135
```
---
layout: model
title: Relation Extraction between different oncological entity types (unspecific version)
author: John Snow Labs
name: re_oncology_wip
date: 2022-09-27
tags: [licensed, clinical, oncology, en, relation_extraction, temporal, test, biomarker, anatomy]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RelationExtractionModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This relation extraction model identifies relations between dates and other clinical entities, between tumor mentions and their size, between anatomical entities and other clinical entities, and between tests and their results. In contrast to re_oncology_granular, all these relation types are labeled as is_related_to. The different types of relations can be identified considering the pairs of entities that are linked.
## Predicted Entities
`is_related_to`, `O`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_oncology_wip_en_4.0.0_3.0_1664302122205.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_oncology_wip_en_4.0.0_3.0_1664302122205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use realation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos_tags")
dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \
.setInputCols(["sentence", "pos_tags", "token"]) \
.setOutputCol("dependencies")
re_model = RelationExtractionModel.pretrained("re_oncology_wip", "en", "clinical/models") \
.setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \
.setOutputCol("relation_extraction") \
.setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"]) \
.setMaxSyntacticDistance(10)
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_model])
data = spark.createDataFrame([["A mastectomy was performed two months ago, and a 3 cm mass was extracted."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos_tags", "token"))
.setOutputCol("dependencies")
val re_model = RelationExtractionModel.pretrained("re_oncology_wip", "en", "clinical/models")
.setInputCols(Array("embeddings", "pos_tags", "ner_chunk", "dependencies"))
.setOutputCol("relation_extraction")
.setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"))
.setMaxSyntacticDistance(10)
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_model))
val data = Seq("A mastectomy was performed two months ago, and a 3 cm mass was extracted.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.oncology_wip").predict("""A mastectomy was performed two months ago, and a 3 cm mass was extracted.""")
```
## Results
```bash
chunk1 entity1 chunk2 entity2 relation confidence
mastectomy Cancer_Surgery two months ago Relative_Date is_related_to 0.9623304
3 cm Tumor_Size mass Tumor_Finding is_related_to 0.7947009
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|re_oncology_wip|
|Type:|re|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]|
|Output Labels:|[relations]|
|Language:|en|
|Size:|266.3 KB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
relation recall precision f1
O 0.82 0.88 0.85
is_related_to 0.89 0.83 0.86
macro-avg 0.86 0.86 0.86
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Austro-Asiatic languages
author: John Snow Labs
name: opus_mt_en_aav
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, aav, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `aav`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_aav_xx_2.7.0_2.4_1609169278246.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_aav_xx_2.7.0_2.4_1609169278246.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_aav", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_aav", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.aav').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_aav|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from CNT-UPenn)
author: John Snow Labs
name: roberta_qa_for_seizurefrequency
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RoBERTa_for_seizureFrequency_QA` is a English model originally trained by `CNT-UPenn`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_for_seizurefrequency_en_4.3.0_3.0_1674208667059.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_for_seizurefrequency_en_4.3.0_3.0_1674208667059.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_for_seizurefrequency","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_for_seizurefrequency","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_for_seizurefrequency|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|466.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/CNT-UPenn/RoBERTa_for_seizureFrequency_QA
- https://doi.org/10.1093/jamia/ocac018
---
layout: model
title: Pipeline to Detect Living Species(w2v_cc_300d)
author: John Snow Labs
name: ner_living_species_pipeline
date: 2023-03-13
tags: [pt, ner, clinical, licensed]
task: Named Entity Recognition
language: pt
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_living_species](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_pt_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_pt_4.3.0_3.2_1678708110628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_pt_4.3.0_3.2_1678708110628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_living_species_pipeline", "pt", "clinical/models")
text = '''Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_living_species_pipeline", "pt", "clinical/models")
val text = "Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:--------------------|--------:|------:|:------------|-------------:|
| 0 | rapariga | 4 | 11 | HUMAN | 0.9991 |
| 1 | pessoal | 41 | 47 | HUMAN | 0.9765 |
| 2 | paciente | 182 | 189 | HUMAN | 1 |
| 3 | gato | 368 | 371 | SPECIES | 0.9847 |
| 4 | veterinário | 413 | 423 | HUMAN | 0.91 |
| 5 | Trichophyton rubrum | 632 | 650 | SPECIES | 0.9996 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_living_species_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|pt|
|Size:|1.3 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_vasilis TFWav2Vec2ForCTC from vasilis
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_53_finnish_by_vasilis
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_vasilis` is a Finnish model originally trained by vasilis.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_finnish_by_vasilis_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_vasilis_fi_4.2.0_3.0_1664024039687.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_vasilis_fi_4.2.0_3.0_1664024039687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_vasilis", "fi")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_vasilis", "fi")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_53_finnish_by_vasilis|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|fi|
|Size:|1.2 GB|
---
layout: model
title: English T5ForConditionalGeneration Cased model (from yirmibesogluz)
author: John Snow Labs
name: t5_t2t_ner_ade_balanced
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t2t-ner-ade-balanced` is a English model originally trained by `yirmibesogluz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_t2t_ner_ade_balanced_en_4.3.0_3.0_1675107775759.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_t2t_ner_ade_balanced_en_4.3.0_3.0_1675107775759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_t2t_ner_ade_balanced","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_t2t_ner_ade_balanced","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_t2t_ner_ade_balanced|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|924.7 MB|
## References
- https://huggingface.co/yirmibesogluz/t2t-ner-ade-balanced
- https://github.com/gokceuludogan/boun-tabi-smm4h22
---
layout: model
title: Legal Natural Environment Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_natural_environment_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, natural_environment, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_natural_environment_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Natural_Environment or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Natural_Environment`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_natural_environment_bert_en_1.0.0_3.0_1678111551323.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_natural_environment_bert_en_1.0.0_3.0_1678111551323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Natural_Environment]|
|[Other]|
|[Other]|
|[Natural_Environment]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_natural_environment_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Natural_Environment 0.89 0.87 0.88 45
Other 0.88 0.90 0.89 50
accuracy - - 0.88 95
macro-avg 0.88 0.88 0.88 95
weighted-avg 0.88 0.88 0.88 95
```
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from jonfrank)
author: John Snow Labs
name: xlmroberta_ner_jonfrank_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `jonfrank`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jonfrank_base_finetuned_panx_de_4.1.0_3.0_1660434803462.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jonfrank_base_finetuned_panx_de_4.1.0_3.0_1660434803462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jonfrank_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jonfrank_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_jonfrank_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/jonfrank/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Italian T5ForConditionalGeneration Base Cased model (from aiknowyou)
author: John Snow Labs
name: t5_mt5_base_it_paraphraser
date: 2023-01-30
tags: [it, open_source, t5, tensorflow]
task: Text Generation
language: it
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mt5-base-it-paraphraser` is a Italian model originally trained by `aiknowyou`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mt5_base_it_paraphraser_it_4.3.0_3.0_1675105866508.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mt5_base_it_paraphraser_it_4.3.0_3.0_1675105866508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_mt5_base_it_paraphraser","it") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_mt5_base_it_paraphraser","it")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_mt5_base_it_paraphraser|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|it|
|Size:|969.5 MB|
## References
- https://huggingface.co/aiknowyou/mt5-base-it-paraphraser
- https://arxiv.org/abs/2010.11934
- https://colab.research.google.com/drive/1DGeF190gJ3DjRFQiwhFuZalp427iqJNQ
- https://gist.github.com/avidale/44cd35bfcdaf8bedf51d97c468cc8001
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- http://creativecommons.org/licenses/by-nc-sa/4.0/
---
layout: model
title: Detect tumor morphology in Spanish texts
author: John Snow Labs
name: cantemist_scielowiki
date: 2021-07-23
tags: [ner, licensed, oncology, es]
task: Named Entity Recognition
language: es
edition: Spark NLP for Healthcare 3.1.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Detect tumor morphology entities in Spanish text.
## Predicted Entities
`MORFOLOGIA_NEOPLASIA`, `O`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TUMOR_ES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/cantemist_scielowiki_es_3.1.2_3.0_1627080305994.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/cantemist_scielowiki_es_3.1.2_3.0_1627080305994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embedings_stage = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("cantemist_scielowiki", "es", "clinical/models")\
.setInputCols(["sentence", "token", "word_embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('ner_chunk')
pipeline = Pipeline(stages=[
document_assembler,
sentence,
tokenizer,
embedings_stage,
clinical_ner,
ner_converter
])
data = spark.createDataFrame([["""Anamnesis Paciente de 37 años de edad sin antecedentes patológicos ni quirúrgicos de interés. En diciembre de 2012 consultó al Servicio de Urgencias por un cuadro de cefalea aguda e hipostesia del hemicuerpo izquierdo de 15 días de evolución refractario a tratamiento. Exploración neurológica sin focalidad; fondo de ojo: papiledema unilateral. Se solicitaron una TC del SNC, que objetiva una LOE frontal derecha con afectación aparente del cuerpo calloso, y una RM del SNC, que muestra un extenso proceso expansivo intraparenquimatoso frontal derecho que infiltra la rodilla del cuerpo calloso, mal delimitada y sin componente necrótico. Tras la administración de contraste se apreciaban diferentes realces parcheados en la lesión, pero sin definirse una cápsula con aumento del flujo sanguíneo en la lesión, características compatibles con linfoma o astrocitoma anaplásico . El 3 de enero de 2013 se efectúa biopsia intraoperatoria, con diagnóstico histológico de astrocitoma anaplásico GIII"""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embedings_stage = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("cantemist_scielowiki", "es", "clinical/models")
.setInputCols(Array("sentence", "token", "word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence, tokenizer, embedings_stage, clinical_ner, ner_converter))
val data = Seq("""Anamnesis Paciente de 37 años de edad sin antecedentes patológicos ni quirúrgicos de interés. En diciembre de 2012 consultó al Servicio de Urgencias por un cuadro de cefalea aguda e hipostesia del hemicuerpo izquierdo de 15 días de evolución refractario a tratamiento. Exploración neurológica sin focalidad; fondo de ojo: papiledema unilateral. Se solicitaron una TC del SNC, que objetiva una LOE frontal derecha con afectación aparente del cuerpo calloso, y una RM del SNC, que muestra un extenso proceso expansivo intraparenquimatoso frontal derecho que infiltra la rodilla del cuerpo calloso, mal delimitada y sin componente necrótico. Tras la administración de contraste se apreciaban diferentes realces parcheados en la lesión, pero sin definirse una cápsula con aumento del flujo sanguíneo en la lesión, características compatibles con linfoma o astrocitoma anaplásico . El 3 de enero de 2013 se efectúa biopsia intraoperatoria, con diagnóstico histológico de astrocitoma anaplásico GIII""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+---------------------+----------------------+
| token | prediction |
+---------------------+----------------------+
| Anamnesis | O |
| Paciente | O |
| de | O |
| 37 | O |
| años | O |
| de | O |
| edad | O |
| sin | O |
| antecedentes | O |
| patológicos | O |
| ni | O |
| quirúrgicos | O |
| de | O |
| interés | O |
| . | O |
| En | O |
| diciembre | O |
| de | O |
| 2012 | O |
| consultó | O |
| al | O |
| Servicio | O |
| de | O |
| Urgencias | O |
| por | O |
| un | O |
| cuadro | O |
| de | O |
| cefalea | O |
| aguda | O |
| e | O |
| hipostesia | O |
| del | O |
| hemicuerpo | O |
| izquierdo | O |
| de | O |
| 15 | O |
| días | O |
| de | O |
| evolución | O |
| refractario | O |
| a | O |
| tratamiento | O |
| . | O |
| Exploración | O |
| neurológica | O |
| sin | O |
| focalidad | O |
| ; | O |
| fondo | O |
| de | O |
| ojo | O |
| : | O |
| papiledema | O |
| unilateral | O |
| . | O |
| Se | O |
| solicitaron | O |
| una | O |
| TC | O |
| del | O |
| SNC | B-MORFOLOGIA_NEOP... |
| , | O |
| que | O |
| objetiva | O |
| una | O |
| LOE | O |
| frontal | O |
| derecha | O |
| con | O |
| afectación | B-MORFOLOGIA_NEOP... |
| aparente | I-MORFOLOGIA_NEOP... |
| del | I-MORFOLOGIA_NEOP... |
| cuerpo | I-MORFOLOGIA_NEOP... |
| calloso | I-MORFOLOGIA_NEOP... |
| , | O |
| y | O |
| una | O |
| RM | B-MORFOLOGIA_NEOP... |
| del | I-MORFOLOGIA_NEOP... |
| SNC | I-MORFOLOGIA_NEOP... |
| , | O |
| que | O |
| muestra | O |
| un | O |
| extenso | O |
| proceso | B-MORFOLOGIA_NEOP... |
| expansivo | I-MORFOLOGIA_NEOP... |
| intraparenquimatoso | I-MORFOLOGIA_NEOP... |
| frontal | I-MORFOLOGIA_NEOP... |
| derecho | I-MORFOLOGIA_NEOP... |
| que | I-MORFOLOGIA_NEOP... |
| infiltra | I-MORFOLOGIA_NEOP... |
| la | I-MORFOLOGIA_NEOP... |
| rodilla | I-MORFOLOGIA_NEOP... |
| del | I-MORFOLOGIA_NEOP... |
| cuerpo | I-MORFOLOGIA_NEOP... |
| calloso | I-MORFOLOGIA_NEOP... |
| , | O |
| mal | O |
| delimitada | O |
| y | O |
| sin | O |
| componente | O |
| necrótico | O |
| . | O |
| Tras | O |
| la | O |
| administración | O |
| de | O |
| contraste | O |
| se | O |
| apreciaban | O |
| diferentes | O |
| realces | O |
| parcheados | O |
| en | O |
| la | O |
| lesión | O |
| , | O |
| pero | O |
| sin | O |
| definirse | O |
| una | O |
| cápsula | O |
| con | O |
| aumento | O |
| del | O |
| flujo | O |
| sanguíneo | O |
| en | O |
| la | O |
| lesión | O |
| , | O |
| características | O |
| compatibles | O |
| con | O |
| linfoma | O |
| o | O |
| astrocitoma | B-MORFOLOGIA_NEOP... |
| anaplásico | I-MORFOLOGIA_NEOP... |
| . | O |
| El | O |
| 3 | O |
| de | O |
| enero | O |
| de | O |
| 2013 | O |
| se | O |
| efectúa | O |
| biopsia | O |
| intraoperatoria | O |
| , | O |
| con | O |
| diagnóstico | O |
| histológico | O |
| de | O |
| astrocitoma | B-MORFOLOGIA_NEOP... |
| anaplásico | I-MORFOLOGIA_NEOP... |
| GIII | I-MORFOLOGIA_NEOP... |
+---------------------+----------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|cantemist_scielowiki|
|Compatibility:|Spark NLP for Healthcare 3.1.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, word_embeddings]|
|Output Labels:|[ner]|
|Language:|es|
|Dependencies:|embeddings_scielowiki_300d|
## Data Source
The model was trained with the [CANTEMIST](https://temu.bsc.es/cantemist/) data set:
> CANTEMIST is an annotated data set for oncology analysis in the Spanish language containing 1301 oncological clinical case reports with a total of 63,016 sentences and 1093,501 tokens. All documents of the corpus have been manually annotated by clinical experts with
mentions of tumor morphology (in Spanish, “morfología de neoplasia”). There are 16,030 tumor morphology mentions mapped to an eCIE-O code (850 unique codes)
References:
1. P. Ruas, A. Neves, V. D. Andrade, F. M. Couto, Lasigebiotm at cantemist: Named entity recognition and normalization of tumour morphology entities and clinical coding of Spanish health-related documents, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020
2. Antonio Miranda-Escalada, Eulàlia Farré-Maduell, Martin Krallinger. Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings. 303-323 (2020).
## Benchmarking
```bash
label precision recall f1-score support
B-MORFOLOGIA_NEOPLASIA 0.94 0.73 0.83 2474
I-MORFOLOGIA_NEOPLASIA 0.81 0.74 0.77 3169
O 0.99 1.00 1.00 283006
accuracy - - 0.99 288649
macro-avg 0.92 0.82 0.87 288649
weighted-avg 0.99 0.99 0.99 288649
```
---
layout: model
title: Yue Chinese asr_wav2vec2_large_xlsr_cantonese_by_ctl TFWav2Vec2ForCTC from ctl
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_cantonese_by_ctl
date: 2022-09-24
tags: [wav2vec2, yue, audio, open_source, asr]
task: Automatic Speech Recognition
language: yue
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_cantonese_by_ctl` is a Yue Chinese model originally trained by ctl.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_cantonese_by_ctl_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_cantonese_by_ctl_yue_4.2.0_3.0_1664039699316.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_cantonese_by_ctl_yue_4.2.0_3.0_1664039699316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_cantonese_by_ctl", "yue")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_cantonese_by_ctl", "yue")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_cantonese_by_ctl|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|yue|
|Size:|1.2 GB|
---
layout: model
title: Google's Tapas Table Understanding (Small, SQA)
author: John Snow Labs
name: table_qa_tapas_small_finetuned_sqa
date: 2022-09-30
tags: [en, table, qa, question, answering, open_source]
task: Table Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: TapasForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark.
Size of this model: Small
Has aggregation operations?: False
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_small_finetuned_sqa_en_4.2.0_3.0_1664530724535.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_small_finetuned_sqa_en_4.2.0_3.0_1664530724535.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
json_data = """
{
"header": ["name", "money", "age"],
"rows": [
["Donald Trump", "$100,000,000", "75"],
["Elon Musk", "$20,000,000,000,000", "55"]
]
}
"""
queries = [
"Who earns less than 200,000,000?",
"Who earns 100,000,000?",
"How much money has Donald Trump?",
"How old are they?",
]
data = spark.createDataFrame([
[json_data, " ".join(queries)]
]).toDF("table_json", "questions")
document_assembler = MultiDocumentAssembler() \
.setInputCols("table_json", "questions") \
.setOutputCols("document_table", "document_questions")
sentence_detector = SentenceDetector() \
.setInputCols(["document_questions"]) \
.setOutputCol("questions")
table_assembler = TableAssembler()\
.setInputCols(["document_table"])\
.setOutputCol("table")
tapas = TapasForQuestionAnswering\
.pretrained("table_qa_tapas_small_finetuned_sqa","en")\
.setInputCols(["questions", "table"])\
.setOutputCol("answers")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
table_assembler,
tapas
])
model = pipeline.fit(data)
model\
.transform(data)\
.selectExpr("explode(answers) AS answer")\
.select("answer")\
.show(truncate=False)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|answer |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} |
|{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|table_qa_tapas_small_finetuned_sqa|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|110.1 MB|
|Case sensitive:|false|
## References
https://www.microsoft.com/en-us/download/details.aspx?id=54253
---
layout: model
title: Fast Neural Machine Translation Model from Manx to English
author: John Snow Labs
name: opus_mt_gv_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, gv, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `gv`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_gv_en_xx_2.7.0_2.4_1609165146913.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_gv_en_xx_2.7.0_2.4_1609165146913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_gv_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_gv_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.gv.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_gv_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2_en_4.3.0_3.0_1674214328836.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2_en_4.3.0_3.0_1674214328836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|416.2 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-2
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from rowan1224)
author: John Snow Labs
name: distilbert_qa_squad_slp
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-squad-slp` is a English model originally trained by `rowan1224`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_slp_en_4.3.0_3.0_1672774348522.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_slp_en_4.3.0_3.0_1672774348522.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_slp","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_slp","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_squad_slp|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/rowan1224/distilbert-squad-slp
---
layout: model
title: Legal Subordination Clause Binary Classifier
author: John Snow Labs
name: legclf_subordination_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `subordination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `subordination`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subordination_clause_en_1.0.0_3.2_1660124022192.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subordination_clause_en_1.0.0_3.2_1660124022192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[subordination]|
|[other]|
|[other]|
|[subordination]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_subordination_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.94 0.98 0.96 49
subordination 0.96 0.89 0.93 28
accuracy - - 0.95 77
macro-avg 0.95 0.94 0.94 77
weighted-avg 0.95 0.95 0.95 77
```
---
layout: model
title: Detect Living Species (bert_embeddings_bert_base_italian_xxl_cased)
author: John Snow Labs
name: ner_living_species_bert
date: 2022-06-23
tags: [it, ner, clinical, licensed, bert]
task: Named Entity Recognition
language: it
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract living species from clinical texts in Italian which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `bert_embeddings_bert_base_italian_xxl_cased` embeddings.
It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others.
**NOTE :**
1. The text files were translated from Spanish with a neural machine translation system.
2. The annotations were translated with the same neural machine translation system.
3. The translated annotations were transferred to the translated text files using an annotation transfer technology.
## Predicted Entities
`HUMAN`, `SPECIES`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_it_3.5.3_3.0_1655972219820.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_it_3.5.3_3.0_1655972219820.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_italian_xxl_cased", "it")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "it", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_italian_xxl_cased", "it")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "it", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter))
val data = Seq("""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("it.med_ner.living_species.bert").predict("""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.""")
```
## Results
```bash
+----------------+-------+
|ner_chunk |label |
+----------------+-------+
|donna |HUMAN |
|personale |HUMAN |
|madre |HUMAN |
|fratello |HUMAN |
|sorella |HUMAN |
|virus epatotropi|SPECIES|
|HBV |SPECIES|
|HCV |SPECIES|
|HIV |SPECIES|
|paziente |HUMAN |
+----------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_living_species_bert|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|it|
|Size:|16.4 MB|
## References
[https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/)
## Benchmarking
```bash
label precision recall f1-score support
B-HUMAN 0.88 0.95 0.91 2772
B-SPECIES 0.76 0.89 0.82 2860
I-HUMAN 0.70 0.59 0.64 101
I-SPECIES 0.70 0.81 0.75 1036
micro-avg 0.80 0.90 0.85 6769
macro-avg 0.76 0.81 0.78 6769
weighted-avg 0.80 0.90 0.85 6769
```
---
layout: model
title: English asr_wav2vec_finetuned_on_cryptocurrency TFWav2Vec2ForCTC from distractedm1nd
author: John Snow Labs
name: asr_wav2vec_finetuned_on_cryptocurrency
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec_finetuned_on_cryptocurrency` is a English model originally trained by distractedm1nd.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec_finetuned_on_cryptocurrency_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec_finetuned_on_cryptocurrency_en_4.2.0_3.0_1664024959023.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec_finetuned_on_cryptocurrency_en_4.2.0_3.0_1664024959023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec_finetuned_on_cryptocurrency", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec_finetuned_on_cryptocurrency", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec_finetuned_on_cryptocurrency|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: RE Pipeline between Tests, Results, and Dates
author: John Snow Labs
name: re_test_result_date_pipeline
date: 2023-06-13
tags: [licensed, clinical, relation_extraction, tests, results, dates, en]
task: Relation Extraction
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [re_test_result_date](https://nlp.johnsnowlabs.com/2021/02/24/re_test_result_date_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_test_result_date_pipeline_en_4.4.4_3.2_1686665254277.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_test_result_date_pipeline_en_4.4.4_3.2_1686665254277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.date_test_result.pipeline").predict("""He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.date_test_result.pipeline").predict("""He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""")
```
## Results
```bash
Results
| index | relations | entity1 | chunk1 | entity2 | chunk2 |
|-------|--------------|--------------|---------------------|--------------|---------|
| 0 | O | TEST | chest X-ray | MEASUREMENTS | 93% |
| 1 | O | TEST | CT scan | MEASUREMENTS | 93% |
| 2 | is_result_of | TEST | SpO2 | MEASUREMENTS | 93% |
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|re_test_result_date_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- PerceptronModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- DependencyParserModel
- RelationExtractionModel
---
layout: model
title: English BertForQuestionAnswering model (from aymanm419)
author: John Snow Labs
name: bert_qa_araSpeedest
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `araSpeedest` is a English model orginally trained by `aymanm419`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_araSpeedest_en_4.0.0_3.0_1654179104397.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_araSpeedest_en_4.0.0_3.0_1654179104397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_araSpeedest","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_araSpeedest","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.by_aymanm419").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_araSpeedest|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|505.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aymanm419/araSpeedest
---
layout: model
title: Legal No material adverse change Clause Binary Classifier
author: John Snow Labs
name: legclf_no_material_adverse_change_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `no-material-adverse-change` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `no-material-adverse-change`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_material_adverse_change_clause_en_1.0.0_3.2_1660122691534.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_material_adverse_change_clause_en_1.0.0_3.2_1660122691534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[no-material-adverse-change]|
|[other]|
|[other]|
|[no-material-adverse-change]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_no_material_adverse_change_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
no-material-adverse-change 0.95 0.97 0.96 37
other 0.99 0.98 0.99 103
accuracy - - 0.98 140
macro-avg 0.97 0.98 0.97 140
weighted-avg 0.98 0.98 0.98 140
```
---
layout: model
title: Detect Anatomical Regions (MedicalBertForTokenClassifier)
author: John Snow Labs
name: bert_token_classifier_ner_anatomy
date: 2022-01-06
tags: [anatomy, bertfortokenclassification, ner, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for anatomy terms. This model is trained with the BertForTokenClassification method from the transformers library and imported into Spark NLP.
## Predicted Entities
`Anatomical_system`, `Cell`, `Cellular_component`, `Developing_anatomical_structure`, `Immaterial_anatomical_entity`, `Multi-tissue_structure`, `Organ`, `Organism_subdivision`, `Organism_substance`, `Pathological_formation`, `Tissue`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_en_3.3.4_2.4_1641454747169.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_en_3.3.4_2.4_1641454747169.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_anatomy", "en", "clinical/models")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter() \
.setInputCols(["document","token","ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter])
data = spark.createDataFrame([["""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_anatomy", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))
val data = Seq("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.ner_anatomy").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""")
```
## Results
```bash
+-------------------+----------------------+
|chunk |ner_label |
+-------------------+----------------------+
|great toe |Multi-tissue_structure|
|skin |Organ |
|conjunctivae |Multi-tissue_structure|
|Extraocular muscles|Multi-tissue_structure|
|Nares |Multi-tissue_structure|
|turbinates |Multi-tissue_structure|
|Oropharynx |Multi-tissue_structure|
|Mucous membranes |Tissue |
|Neck |Organism_subdivision |
|bowel |Organ |
|great toe |Multi-tissue_structure|
|skin |Organ |
|toenails |Organism_subdivision |
|foot |Organism_subdivision |
|great toe |Multi-tissue_structure|
|toenails |Organism_subdivision |
+-------------------+----------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_anatomy|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|404.4 MB|
|Case sensitive:|true|
|Max sentense length:|512|
## Data Source
Trained on the Anatomical Entity Mention (AnEM) corpus with 'embeddings_clinical'. http://www.nactem.ac.uk/anatomy/
## Benchmarking
```bash
label precision recall f1-score support
B-Anatomical_system 1.00 0.50 0.67 4
B-Cell 0.89 0.96 0.92 74
B-Cellular_component 0.97 0.81 0.88 36
B-Developing_anatomical_structure 1.00 0.50 0.67 6
B-Immaterial_anatomical_entity 0.60 0.75 0.67 4
B-Multi-tissue_structure 0.75 0.86 0.80 58
B-Organ 0.86 0.88 0.87 48
B-Organism_subdivision 0.62 0.42 0.50 12
B-Organism_substance 0.89 0.81 0.85 31
B-Pathological_formation 0.91 0.91 0.91 32
B-Tissue 0.94 0.76 0.84 21
I-Anatomical_system 1.00 1.00 1.00 1
I-Cell 1.00 0.84 0.91 62
I-Cellular_component 0.92 0.85 0.88 13
I-Developing_anatomical_structure 1.00 1.00 1.00 1
I-Immaterial_anatomical_entity 1.00 1.00 1.00 1
I-Multi-tissue_structure 1.00 0.77 0.87 26
I-Organ 0.80 0.80 0.80 5
I-Organism_substance 1.00 0.71 0.83 7
I-Pathological_formation 1.00 0.94 0.97 16
I-Tissue 0.93 0.93 0.93 15
accuracy - - 0.84 473
macro-avg 0.87 0.77 0.83 473
weighted-avg 0.90 0.84 0.86 473
```
---
layout: model
title: Part of Speech for Latin
author: John Snow Labs
name: pos_ud_llct
date: 2020-07-29 23:34:00 +0800
task: Part of Speech Tagging
language: la
edition: Spark NLP 2.5.5
spark_version: 2.4
tags: [pos, la]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_llct_la_2.5.5_2.4_1596054191115.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_llct_la_2.5.5_2.4_1596054191115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_llct", "la") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene.")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_llct", "la")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene."""]
pos_df = nlu.load('la.pos').predict(text, output_level='token')
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=4, result='PROPN', metadata={'word': 'Alius'}),
Row(annotatorType='pos', begin=6, end=8, result='AUX', metadata={'word': 'est'}),
Row(annotatorType='pos', begin=10, end=13, result='VERB', metadata={'word': 'esse'}),
Row(annotatorType='pos', begin=15, end=19, result='VERB', metadata={'word': 'regem'}),
Row(annotatorType='pos', begin=21, end=29, result='PROPN', metadata={'word': 'Aquilonis'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_llct|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.5+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|la|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Japanese T5ForConditionalGeneration Cased model (from astremo)
author: John Snow Labs
name: t5_friendly
date: 2023-01-30
tags: [ja, open_source, t5, tensorflow]
task: Text Generation
language: ja
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `friendly_JA` is a Japanese model originally trained by `astremo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_friendly_ja_4.3.0_3.0_1675102435483.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_friendly_ja_4.3.0_3.0_1675102435483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_friendly","ja") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_friendly","ja")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_friendly|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|ja|
|Size:|923.1 MB|
## References
- https://huggingface.co/astremo/friendly_JA
- http://creativecommons.org/licenses/by/4.0/
- http://creativecommons.org/licenses/by/4.0/
- http://creativecommons.org/licenses/by/4.0/
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from skandaonsolve)
author: John Snow Labs
name: roberta_qa_finetuned_timeentities2_ttsp75
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-timeentities2_ttsp75` is a English model originally trained by `skandaonsolve`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities2_ttsp75_en_4.3.0_3.0_1674220728523.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities2_ttsp75_en_4.3.0_3.0_1674220728523.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities2_ttsp75","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities2_ttsp75","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_finetuned_timeentities2_ttsp75|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/skandaonsolve/roberta-finetuned-timeentities2_ttsp75
---
layout: model
title: Fast Neural Machine Translation Model from Hausa to English
author: John Snow Labs
name: opus_mt_ha_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, ha, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `ha`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ha_en_xx_2.7.0_2.4_1609168807678.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ha_en_xx_2.7.0_2.4_1609168807678.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ha_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ha_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.ha.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ha_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Financial Zero-shot NER
author: John Snow Labs
name: finner_roberta_zeroshot
date: 2022-09-02
tags: [en, finance, ner, zero, shot, zeroshot, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
recommended: true
annotator: ZeroShotNER
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained to carry out a Zero-Shot Named Entity Recognition (NER) approach, detecting any kind of entities with no training dataset, just tje pretrained RoBERTa embeddings (included in the model) and some examples.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_ZEROSHOT/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_roberta_zeroshot_en_1.0.0_3.2_1662113599526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_roberta_zeroshot_en_1.0.0_3.2_1662113599526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sparktokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
zero_shot_ner = finance.ZeroShotNerModel.pretrained("finner_roberta_zeroshot", "en", "finance/models")\
.setInputCols(["document", "token"])\
.setOutputCol("zero_shot_ner")\
.setEntityDefinitions(
{
"DATE": ['When was the company acquisition?', 'When was the company purchase agreement?'],
"ORG": ["Which company was acquired?"],
"PRODUCT": ["Which product?"],
"PROFIT_INCREASE": ["How much has the gross profit increased?"],
"REVENUES_DECLINED": ["How much has the revenues declined?"],
"OPERATING_LOSS_2020": ["Which was the operating loss in 2020"],
"OPERATING_LOSS_2019": ["Which was the operating loss in 2019"]
})
nerconverter = nlp.NerConverter()\
.setInputCols(["document", "token", "zero_shot_ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sparktokenizer,
zero_shot_ner,
nerconverter,
]
)
sample_text = ["In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
"In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
"While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020 as compared to 2019."
"We reported an operating loss of approximately $8,048,581 million in 2020 as compared to an operating loss of approximately $7,738,193 million in 2019."]
p_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
res = p_model.transform(spark.createDataFrame(sample_text, StringType()).toDF("text"))
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['3']['entity']").alias("ner_label"))\
.filter("ner_label!='O'")\
.show(truncate=False)
```
## Results
```bash
+------------------+-------------------+
|chunk |ner_label |
+------------------+-------------------+
|March 2012 |DATE |
|Vertro |ORG |
|ALOT |PRODUCT |
|February 2017 |DATE |
|NetSeer |ORG |
|81.4% |PROFIT_INCREASE |
|27% |REVENUES_DECLINED |
|$8,048,581 million|OPERATING_LOSS_2020|
|$7,738,193 million|OPERATING_LOSS_2019|
+------------------+-------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finner_roberta_zeroshot|
|Type:|finance|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|460.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
Financial Roberta Embeddings
---
layout: model
title: Malay T5ForConditionalGeneration Base Cased model (from mesolitica)
author: John Snow Labs
name: t5_base_bahasa_cased
date: 2023-01-30
tags: [ms, open_source, t5, tensorflow]
task: Text Generation
language: ms
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-bahasa-cased` is a Malay model originally trained by `mesolitica`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_bahasa_cased_ms_4.3.0_3.0_1675108290125.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_bahasa_cased_ms_4.3.0_3.0_1675108290125.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_base_bahasa_cased","ms") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_base_bahasa_cased","ms")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_base_bahasa_cased|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|ms|
|Size:|473.3 MB|
## References
- https://huggingface.co/mesolitica/t5-base-bahasa-cased
- https://github.com/huseinzol05/malaya/tree/master/pretrained-model/t5/prepare
- https://github.com/google-research/text-to-text-transfer-transformer
- https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from LucasS)
author: John Snow Labs
name: roberta_qa_robertabaseabsa
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robertaBaseABSA` is a English model originally trained by `LucasS`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_robertabaseabsa_en_4.3.0_3.0_1674222849343.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_robertabaseabsa_en_4.3.0_3.0_1674222849343.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertabaseabsa","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertabaseabsa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_robertabaseabsa|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|437.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/LucasS/robertaBaseABSA
---
layout: model
title: Arabic BertForMaskedLM Base Cased model (from CAMeL-Lab)
author: John Snow Labs
name: bert_embeddings_base_arabic_camel_msa
date: 2022-12-02
tags: [ar, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ar
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabic-camelbert-msa` is a Arabic model originally trained by `CAMeL-Lab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_ar_4.2.4_3.0_1670016029582.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_ar_4.2.4_3.0_1670016029582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_arabic_camel_msa|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|409.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa
- https://arxiv.org/abs/2103.06678
- https://github.com/CAMeL-Lab/CAMeLBERT
- https://catalog.ldc.upenn.edu/LDC2011T11
- http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus
- https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian
- https://archive.org/details/arwiki-20190201
- https://oscar-corpus.com/
- https://github.com/google-research/bert
- https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297
- https://github.com/CAMeL-Lab/camel_tools
- https://github.com/CAMeL-Lab/CAMeLBERT
---
layout: model
title: French RoBERTa Embeddings (from benjamin)
author: John Snow Labs
name: roberta_embeddings_roberta_base_wechsel_french
date: 2022-04-14
tags: [roberta, embeddings, fr, open_source]
task: Embeddings
language: fr
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-wechsel-french` is a French model orginally trained by `benjamin`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_wechsel_french_fr_3.4.2_3.0_1649947929675.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_wechsel_french_fr_3.4.2_3.0_1649947929675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_wechsel_french","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark Nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_wechsel_french","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark Nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("fr.embed.roberta_base_wechsel_french").predict("""J'adore Spark Nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_roberta_base_wechsel_french|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|fr|
|Size:|468.6 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/benjamin/roberta-base-wechsel-french
- https://github.com/CPJKU/wechsel
- https://arxiv.org/abs/2112.06598
---
layout: model
title: Stop Words Cleaner for Basque
author: John Snow Labs
name: stopwords_eu
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: eu
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, eu]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_eu_eu_2.5.4_2.4_1594742441951.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_eu_eu_2.5.4_2.4_1594742441951.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_eu", "eu") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_eu", "eu")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow."""]
stopword_df = nlu.load('eu.stopwords').predict(text)
stopword_df[["cleanTokens"]]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=10, result='Iparraldeko', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=12, end=18, result='erregea', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=20, end=26, result='izateaz', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=28, end=31, result='gain', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=32, end=32, result=',', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_eu|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|eu|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: ICDO Entity Resolver
author: John Snow Labs
name: chunkresolve_icdo_clinical
class: ChunkEntityResolverModel
language: en
nav_key: models
repository: clinical/models
date: 2020-04-21
task: Entity Resolution
edition: Healthcare NLP 2.4.2
spark_version: 2.4
tags: [clinical,licensed,entity_resolution,en]
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance
## Predicted Entities
ICD-O Codes and their normalized definition with `clinical_embeddings`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICDO/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICDO.ipynb#scrollTo=Qdh2BQaLI7tU){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icdo_clinical_en_2.4.5_2.4_1587491354644.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icdo_clinical_en_2.4.5_2.4_1587491354644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
...
model = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical","en","clinical/models")
.setInputCols("token","chunk_embeddings")
.setOutputCol("entity")
pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner_model, clinical_ner_chunker, chunk_embeddings, model])
data = ["""DIAGNOSIS: Left breast adenocarcinoma stage T3 N1b M0, stage IIIA.
She has been found more recently to have stage IV disease with metastatic deposits and recurrence involving the chest wall and lower left neck lymph nodes.
PHYSICAL EXAMINATION
NECK: On physical examination palpable lymphadenopathy is present in the left lower neck and supraclavicular area. No other cervical lymphadenopathy or supraclavicular lymphadenopathy is present.
RESPIRATORY: Good air entry bilaterally. Examination of the chest wall reveals a small lesion where the chest wall recurrence was resected. No lumps, bumps or evidence of disease involving the right breast is present.
ABDOMEN: Normal bowel sounds, no hepatomegaly. No tenderness on deep palpation. She has just started her last cycle of chemotherapy today, and she wishes to visit her daughter in Brooklyn, New York. After this she will return in approximately 3 to 4 weeks and begin her radiotherapy treatment at that time."""]
pipeline_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_pipeline = LightPipeline(pipeline_model)
result = light_pipeline.annotate(data)
```
```scala
...
val model = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical","en","clinical/models")
.setInputCols("token","chunk_embeddings")
.setOutputCol("entity")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner_model, clinical_ner_chunker, chunk_embeddings, model))
val data = Seq("DIAGNOSIS: Left breast adenocarcinoma stage T3 N1b M0, stage IIIA. She has been found more recently to have stage IV disease with metastatic deposits and recurrence involving the chest wall and lower left neck lymph nodes. PHYSICAL EXAMINATION NECK: On physical examination palpable lymphadenopathy is present in the left lower neck and supraclavicular area. No other cervical lymphadenopathy or supraclavicular lymphadenopathy is present. RESPIRATORY: Good air entry bilaterally. Examination of the chest wall reveals a small lesion where the chest wall recurrence was resected. No lumps, bumps or evidence of disease involving the right breast is present. ABDOMEN: Normal bowel sounds, no hepatomegaly. No tenderness on deep palpation. She has just started her last cycle of chemotherapy today, and she wishes to visit her daughter in Brooklyn, New York. After this she will return in approximately 3 to 4 weeks and begin her radiotherapy treatment at that time.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
| | chunk | begin | end | entity | idco_description | icdo_code |
|---|----------------------------|-------|-----|--------|---------------------------------------------|-----------|
| 0 | Left breast adenocarcinoma | 11 | 36 | Cancer | Intraductal carcinoma, noninfiltrating, NOS | 8500/2 |
| 1 | T3 N1b M0 | 44 | 52 | Cancer | Kaposi sarcoma | 9140/3 |
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|----------------------------|
| Name: | chunkresolve_icdo_clinical |
| Type: | ChunkEntityResolverModel |
| Compatibility: | Spark NLP 2.4.2+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | token, chunk_embeddings |
|Output labels: | entity |
| Language: | en |
| Case sensitive: | True |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on ICD-O Histology Behaviour dataset
https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from FOFer)
author: John Snow Labs
name: distilbert_qa_fofer_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `FOFer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_fofer_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768487186.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_fofer_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768487186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_fofer_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_fofer_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_fofer_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/FOFer/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English BertForQuestionAnswering model (from aodiniz)
author: John Snow Labs
name: bert_qa_bert_uncased_L_10_H_512_A_8_squad2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-10_H-512_A-8_squad2` is a English model orginally trained by `aodiniz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_10_H_512_A_8_squad2_en_4.0.0_3.0_1654185195977.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_10_H_512_A_8_squad2_en_4.0.0_3.0_1654185195977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_10_H_512_A_8_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_uncased_L_10_H_512_A_8_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert.uncased_10l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_uncased_L_10_H_512_A_8_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|178.3 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/aodiniz/bert_uncased_L-10_H-512_A-8_squad2
---
layout: model
title: Arabic Part of Speech Tagger (from CAMeL-Lab)
author: John Snow Labs
name: bert_pos_bert_base_arabic_camelbert_ca_pos_egy
date: 2022-04-26
tags: [bert, pos, part_of_speech, ar, open_source]
task: Part of Speech Tagging
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-ca-pos-egy` is a Arabic model orginally trained by `CAMeL-Lab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_ca_pos_egy_ar_3.4.2_3.0_1650993368525.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_ca_pos_egy_ar_3.4.2_3.0_1650993368525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_ca_pos_egy","ar") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_ca_pos_egy","ar")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("أنا أحب الشرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.pos.arabic_camelbert_ca_pos_egy").predict("""أنا أحب الشرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_bert_base_arabic_camelbert_ca_pos_egy|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ar|
|Size:|407.4 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-ca-pos-egy
- https://arxiv.org/abs/2103.06678
- https://github.com/CAMeL-Lab/CAMeLBERT
- https://github.com/CAMeL-Lab/camel_tools
---
layout: model
title: Legal Confidentiality Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_confidentiality_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, confidentiality, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Confidentiality` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Confidentiality`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_confidentiality_bert_en_1.0.0_3.0_1678050626282.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_confidentiality_bert_en_1.0.0_3.0_1678050626282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Confidentiality]|
|[Other]|
|[Other]|
|[Confidentiality]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_confidentiality_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Confidentiality 0.94 0.99 0.96 123
Other 0.99 0.95 0.97 150
accuracy - - 0.97 273
macro-avg 0.97 0.97 0.97 273
weighted-avg 0.97 0.97 0.97 273
```
---
layout: model
title: Word Embeddings for Japanese (japanese_cc_300d)
author: John Snow Labs
name: japanese_cc_300d
date: 2021-09-09
tags: [embeddings, open_source, ja]
task: Embeddings
language: ja
edition: Spark NLP 3.2.2
spark_version: 3.0
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained on Common Crawl and Wikipedia using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words.
These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/japanese_cc_300d_ja_3.2.2_3.0_1631192388744.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/japanese_cc_300d_ja_3.2.2_3.0_1631192388744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("japanese_cc_300d", "ja") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
word_segmenter,
embeddings
])
data = spark.createDataFrame([["宮本茂氏は、日本の任天堂のゲームプロデューサーです。"]]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(arrays_zip(embeddings.result, embeddings.embeddings))").show()
```
```scala
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{SentenceDetector, WordSegmenterModel}
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("japanese_cc_300d", "ja")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
word_segmenter,
embeddings
))
val data = Seq("宮本茂氏は、日本の任天堂のゲームプロデューサーです。").toDF("text")
val model = pipeline.fit(data)
val result = model.transform(data)
result.selectExpr("explode(arrays_zip(embeddings.result, embeddings.embeddings))").show()
```
{:.nlu-block}
```python
import nlu
nlu.load("ja.embed.glove.cc_300d").predict("""explode(arrays_zip(embeddings.result, embeddings.embeddings))""")
```
## Results
```bash
+---------------------------+
| col|
+---------------------------+
| [宮本, [0.1944, 0.4...|
| [茂, [-0.079, 0.09...|
| [氏, [-0.1053, 0.1...|
| [は, [0.0732, -0.0...|
| [、, [0.0571, -0.0...|
| [日本, [0.1844, 0.0...|
| [の, [0.0109, -0.0...|
| [任天, [0.0, 0.0, 0...|
| [堂, [-0.1972, 0.0...|
| [の, [0.0109, -0.0...|
| [ゲーム, [0.013, 0.0...|
|[プロデューサー, [-0.010...|
| [です, [0.0036, -0....|
| [。, [0.069, -0.01...|
+---------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|japanese_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.2.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|ja|
|Case sensitive:|false|
|Dimension:|300|
## Data Source
This model is imported from https://fasttext.cc/docs/en/crawl-vectors.html
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_fpdm_triplet_ft_new_news
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_triplet_roberta_FT_new_newsqa` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_ft_new_news_en_4.3.0_3.0_1674211184082.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_ft_new_news_en_4.3.0_3.0_1674211184082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_triplet_ft_new_news","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_triplet_ft_new_news","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_fpdm_triplet_ft_new_news|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|461.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/fpdm_triplet_roberta_FT_new_newsqa
---
layout: model
title: Legal Indemnifications Clause Binary Classifier
author: John Snow Labs
name: legclf_cuad_indemnifications_clause
date: 2022-09-27
tags: [cuad, indemnifications, en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `indemnifications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `indemnifications`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_indemnifications_clause_en_1.0.0_3.0_1664272531526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_indemnifications_clause_en_1.0.0_3.0_1664272531526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[indemnifications]|
|[other]|
|[other]|
|[indemnifications]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_cuad_indemnifications_clause|
|Type:|legal|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|21.9 MB|
## References
In-house annotations on CUAD dataset
## Benchmarking
```bash
label precision recall f1-score support
indemnifications 1.00 0.83 0.91 12
other 0.83 1.00 0.91 10
accuracy - - 0.91 22
macro avg 0.92 0.92 0.91 22
weighted avg 0.92 0.91 0.91 22
```
---
layout: model
title: English BertForMaskedLM Large Cased model
author: John Snow Labs
name: bert_embeddings_large_cased_whole_word_masking
date: 2022-12-02
tags: [en, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-cased-whole-word-masking` is a English model originally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_cased_whole_word_masking_en_4.2.4_3.0_1670020123161.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_cased_whole_word_masking_en_4.2.4_3.0_1670020123161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_cased_whole_word_masking","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_cased_whole_word_masking","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_large_cased_whole_word_masking|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/bert-large-cased-whole-word-masking
- https://arxiv.org/abs/1810.04805
- https://github.com/google-research/bert
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
---
layout: model
title: English RobertaForQuestionAnswering (from comacrae)
author: John Snow Labs
name: roberta_qa_roberta_eda_and_parav3
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-eda-and-parav3` is a English model originally trained by `comacrae`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_eda_and_parav3_en_4.0.0_3.0_1655735762543.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_eda_and_parav3_en_4.0.0_3.0_1655735762543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_eda_and_parav3","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_eda_and_parav3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.roberta.eda_and_parav3.by_comacrae").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_eda_and_parav3|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/comacrae/roberta-eda-and-parav3
---
layout: model
title: Part of Speech for Bulgarian
author: John Snow Labs
name: pos_ud_btb
date: 2021-03-08
tags: [part_of_speech, open_source, bulgarian, pos_ud_btb, bg]
task: Part of Speech Tagging
language: bg
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- ADP
- NOUN
- PUNCT
- VERB
- AUX
- PRON
- ADJ
- PART
- ADV
- INTJ
- DET
- PROPN
- CCONJ
- NUM
- SCONJ
- X
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_btb_bg_3.0.0_3.0_1615230275121.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_btb_bg_3.0.0_3.0_1615230275121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_btb", "bg") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['Здравейте от Lak Snow Labs! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_btb", "bg")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("Здравейте от Lak Snow Labs! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Здравейте от Lak Snow Labs! ""]
token_df = nlu.load('bg.pos.ud_btb').predict(text)
token_df
```
## Results
```bash
token pos
0 Здравейте VERB
1 от ADP
2 Lak ADJ
3 Snow PROPN
4 Labs PROPN
5 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_btb|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|bg|
---
layout: model
title: Pipeline for Detect Medication
author: John Snow Labs
name: ner_medication_pipeline
date: 2022-07-28
tags: [ner, en, licensed]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pretrained pipeline to detect medication entities. It was built on the top of `ner_posology_greedy` model and also augmented with the drug names mentioned in UK and US drugbank datasets.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_medication_pipeline_en_4.0.0_3.0_1658987434372.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_medication_pipeline_en_4.0.0_3.0_1658987434372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
ner_medication_pipeline = PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models")
text = """The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."""
result = ner_medication_pipeline.fullAnnotate([text])
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val ner_medication_pipeline = new PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models")
val result = ner_medication_pipeline.fullAnnotate("The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."")
```
{:.nlu-block}
```python
| ner_chunk | entity |
|:-------------------|:---------|
| metformin 1000 MG | DRUG |
| glipizide 2.5 MG | DRUG |
| Fragmin 5000 units | DRUG |
| Xenaderm | DRUG |
| OxyContin 30 mg | DRUG |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_medication_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
- TextMatcherModel
- ChunkMergeModel
- Finisher
---
layout: model
title: ALBERT Embeddings (Large Uncase)
author: John Snow Labs
name: albert_large_uncased
date: 2020-04-28
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: AlBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The details are described in the paper "[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.](https://arxiv.org/abs/1909.11942)"
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_large_uncased_en_2.5.0_2.4_1588073397355.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_large_uncased_en_2.5.0_2.4_1588073397355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = AlbertEmbeddings.pretrained("albert_large_uncased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = AlbertEmbeddings.pretrained("albert_large_uncased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.albert.large_uncased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_albert_large_uncased_embeddings
I [0.3967159688472748, -0.6448764801025391, -0.3...
love [1.1107065677642822, -0.2454298734664917, 0.60...
NLP [0.02937467396259308, -0.7092287540435791, -0....
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_large_uncased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|1024|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/google/albert_large/3](https://tfhub.dev/google/albert_large/3)
---
layout: model
title: Stop Words Cleaner for Russian
author: John Snow Labs
name: stopwords_ru
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: ru
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, ru]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_ru_ru_2.5.4_2.4_1594742439248.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_ru_ru_2.5.4_2.4_1594742439248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_ru", "ru") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_ru", "ru")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены."""]
stopword_df = nlu.load('ru.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=5, result='Помимо', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=11, end=11, result=',', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=20, end=25, result='король', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=27, end=32, result='севера', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=33, end=33, result=',', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_ru|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|ru|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: Arabic Bert Embeddings (Large)
author: John Snow Labs
name: bert_embeddings_bert_large_arabic
date: 2022-04-11
tags: [bert, embeddings, ar, open_source]
task: Embeddings
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-arabic` is a Arabic model orginally trained by `asafaya`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabic_ar_3.4.2_3.0_1649678414101.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabic_ar_3.4.2_3.0_1649678414101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabic","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabic","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.bert_large_arabic").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_large_arabic|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/asafaya/bert-large-arabic
- https://traces1.inria.fr/oscar/
- http://commoncrawl.org/
- https://dumps.wikimedia.org/backup-index.html
- https://github.com/google-research/bert
- https://www.tensorflow.org/tfrc
- https://github.com/alisafaya/Arabic-BERT
---
layout: model
title: Hindi BertForQuestionAnswering model (from Sindhu)
author: John Snow Labs
name: bert_qa_muril_large_squad2
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: hi
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `muril-large-squad2` is a Hindi model orginally trained by `Sindhu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_muril_large_squad2_hi_4.0.0_3.0_1654188807056.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_muril_large_squad2_hi_4.0.0_3.0_1654188807056.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_muril_large_squad2","hi") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_muril_large_squad2","hi")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("hi.answer_question.squadv2.bert.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_muril_large_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|hi|
|Size:|1.9 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Sindhu/muril-large-squad2
- https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
- https://twitter.com/batw0man
---
layout: model
title: Google's Tapas Table Understanding (Medium, WIKISQL)
author: John Snow Labs
name: table_qa_tapas_medium_finetuned_wikisql_supervised
date: 2022-09-30
tags: [en, table, qa, question, answering, open_source]
task: Table Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: TapasForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark.
Size of this model: Medium
Has aggregation operations?: True
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_medium_finetuned_wikisql_supervised_en_4.2.0_3.0_1664530746170.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_medium_finetuned_wikisql_supervised_en_4.2.0_3.0_1664530746170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
json_data = """
{
"header": ["name", "money", "age"],
"rows": [
["Donald Trump", "$100,000,000", "75"],
["Elon Musk", "$20,000,000,000,000", "55"]
]
}
"""
queries = [
"Who earns less than 200,000,000?",
"Who earns 100,000,000?",
"How much money has Donald Trump?",
"How old are they?",
]
data = spark.createDataFrame([
[json_data, " ".join(queries)]
]).toDF("table_json", "questions")
document_assembler = MultiDocumentAssembler() \
.setInputCols("table_json", "questions") \
.setOutputCols("document_table", "document_questions")
sentence_detector = SentenceDetector() \
.setInputCols(["document_questions"]) \
.setOutputCol("questions")
table_assembler = TableAssembler()\
.setInputCols(["document_table"])\
.setOutputCol("table")
tapas = TapasForQuestionAnswering\
.pretrained("table_qa_tapas_medium_finetuned_wikisql_supervised","en")\
.setInputCols(["questions", "table"])\
.setOutputCol("answers")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
table_assembler,
tapas
])
model = pipeline.fit(data)
model\
.transform(data)\
.selectExpr("explode(answers) AS answer")\
.select("answer")\
.show(truncate=False)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|answer |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} |
|{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|table_qa_tapas_medium_finetuned_wikisql_supervised|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|157.5 MB|
|Case sensitive:|false|
## References
https://www.microsoft.com/en-us/download/details.aspx?id=54253
https://github.com/ppasupat/WikiTableQuestions
https://github.com/salesforce/WikiSQL
---
layout: model
title: English image_classifier_vit_autotrain_dog_vs_food ViTForImageClassification from abhishek
author: John Snow Labs
name: image_classifier_vit_autotrain_dog_vs_food
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_autotrain_dog_vs_food` is a English model originally trained by abhishek.
## Predicted Entities
`dog`, `food`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_autotrain_dog_vs_food_en_4.1.0_3.0_1660171758307.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_autotrain_dog_vs_food_en_4.1.0_3.0_1660171758307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_autotrain_dog_vs_food", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_autotrain_dog_vs_food", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_autotrain_dog_vs_food|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from Moussab)
author: John Snow Labs
name: roberta_qa_deepset_base_squad2_orkg_no_label_5e_05
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-no-label-5e-05` is a English model originally trained by `Moussab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_no_label_5e_05_en_4.3.0_3.0_1674209664953.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_no_label_5e_05_en_4.3.0_3.0_1674209664953.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_no_label_5e_05","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_no_label_5e_05","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_deepset_base_squad2_orkg_no_label_5e_05|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-no-label-5e-05
---
layout: model
title: English BertForQuestionAnswering model (from HankyStyle)
author: John Snow Labs
name: bert_qa_Multi_ling_BERT
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Multi-ling-BERT` is a English model orginally trained by `HankyStyle`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Multi_ling_BERT_en_4.0.0_3.0_1654178859506.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Multi_ling_BERT_en_4.0.0_3.0_1654178859506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Multi_ling_BERT","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_Multi_ling_BERT","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.by_HankyStyle").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_Multi_ling_BERT|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|626.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/HankyStyle/Multi-ling-BERT
---
layout: model
title: Word2Vec Embeddings in Swahili (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, sw, open_source]
task: Embeddings
language: sw
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sw_3.4.1_3.0_1647459595377.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sw_3.4.1_3.0_1647459595377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sw") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ninapenda Spark NLP."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sw")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ninapenda Spark NLP.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("sw.embed.w2v_cc_300d").predict("""Ninapenda Spark NLP.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|sw|
|Size:|224.0 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: English DistilBertForQuestionAnswering model (from threem) Squad2
author: John Snow Labs
name: distilbert_qa_mysquadv2_8Jan22_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mysquadv2_8Jan22-finetuned-squad` is a English model originally trained by `threem`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_mysquadv2_8Jan22_finetuned_squad_en_4.0.0_3.0_1654728468416.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_mysquadv2_8Jan22_finetuned_squad_en_4.0.0_3.0_1654728468416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mysquadv2_8Jan22_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mysquadv2_8Jan22_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.distil_bert.v2.by_threem").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_mysquadv2_8Jan22_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/threem/mysquadv2_8Jan22-finetuned-squad
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from ArneD)
author: John Snow Labs
name: xlmroberta_ner_arned_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `ArneD`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_arned_base_finetuned_panx_de_4.1.0_3.0_1660429307551.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_arned_base_finetuned_panx_de_4.1.0_3.0_1660429307551.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_arned_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_arned_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_arned_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/ArneD/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_by_anan0329 TFWav2Vec2ForCTC from anan0329
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab_by_anan0329
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_anan0329` is a English model originally trained by anan0329.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_anan0329_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_anan0329_en_4.2.0_3.0_1664114661625.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_anan0329_en_4.2.0_3.0_1664114661625.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab_by_anan0329", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab_by_anan0329", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab_by_anan0329|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: NER Pipeline for Tests - Voice of the Patient
author: John Snow Labs
name: ner_vop_test_pipeline
date: 2023-06-10
tags: [licensed, pipeline, ner, en, vop, test]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline extracts mentions of tests and their results from health-related text in colloquial language.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_pipeline_en_4.4.3_3.0_1686427000395.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_pipeline_en_4.4.3_3.0_1686427000395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_vop_test_pipeline", "en", "clinical/models")
pipeline.annotate("
I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it.
")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_vop_test_pipeline", "en", "clinical/models")
val result = pipeline.annotate("
I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it.
")
```
## Results
```bash
| chunk | ner_label |
|:---------------|:------------|
| thyroid levels | Test |
| blood test | Test |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_test_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|791.6 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Legal Representations Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_representations_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, representations, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Representations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Representations`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_representations_bert_en_1.0.0_3.0_1678050021027.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_representations_bert_en_1.0.0_3.0_1678050021027.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Representations]|
|[Other]|
|[Other]|
|[Representations]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_representations_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.92 0.91 0.92 93
Representations 0.88 0.89 0.89 66
accuracy - - 0.91 159
macro-avg 0.90 0.90 0.90 159
weighted-avg 0.91 0.91 0.91 159
```
---
layout: model
title: RxNorm to MeSH Code Mapping
author: John Snow Labs
name: rxnorm_mesh_mapping
date: 2023-06-13
tags: [rxnorm, mesh, en, licensed, pipeline]
task: Pipeline Healthcare
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline maps RxNorm codes to MeSH codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding MeSH codes as a list. If there is no mapping, the original code is returned with no mapping.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_mesh_mapping_en_4.4.4_3.2_1686663529810.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_mesh_mapping_en_4.4.4_3.2_1686663529810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models")
pipeline.annotate("1191 6809 47613")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models")
val result = pipeline.annotate("1191 6809 47613")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.rxnorm.mesh").predict("""1191 6809 47613""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models")
pipeline.annotate("1191 6809 47613")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models")
val result = pipeline.annotate("1191 6809 47613")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.rxnorm.mesh").predict("""1191 6809 47613""")
```
## Results
```bash
Results
{'rxnorm': ['1191', '6809', '47613'],
'mesh': ['D001241', 'D008687', 'D019355']}
Note:
| RxNorm | Details |
| ---------- | -------------------:|
| 1191 | aspirin |
| 6809 | metformin |
| 47613 | calcium citrate |
| MeSH | Details |
| ---------- | -------------------:|
| D001241 | Aspirin |
| D008687 | Metformin |
| D019355 | Calcium Citrate |
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|rxnorm_mesh_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|103.6 KB|
## Included Models
- DocumentAssembler
- TokenizerModel
- LemmatizerModel
- Finisher
---
layout: model
title: T5 for Active to Passive Style Transfer
author: John Snow Labs
name: t5_active_to_passive_styletransfer
date: 2022-01-12
tags: [t5, open_source, en]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 3.4.0
spark_version: 3.0
supported: true
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a text-to-text model based on T5 fine-tuned to generate actively written text from a passively written text input, for the task "transfer Active to Passive:". It is based on Prithiviraj Damodaran's Styleformer.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/T5_LINGUISTIC/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5_LINGUISTIC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_active_to_passive_styletransfer_en_3.4.0_3.0_1641987400533.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_active_to_passive_styletransfer_en_3.4.0_3.0_1641987400533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
spark = sparknlp.start()
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
t5 = T5Transformer.pretrained("t5_active_to_passive_styletransfer") \
.setTask("transfer Active to Passive:") \
.setInputCols(["documents"]) \
.setMaxOutputLength(200) \
.setOutputCol("transfers")
pipeline = Pipeline().setStages([documentAssembler, t5])
data = spark.createDataFrame([["I am writing you a letter."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("transfers.result").show(truncate=False)
```
```scala
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("documents")
val t5 = T5Transformer.pretrained("t5_active_to_passive_styletransfer")
.setTask("transfer Active to Passive:")
.setMaxOutputLength(200)
.setInputCols("documents")
.setOutputCol("transfer")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("I am writing you a letter.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("transfer.result").show(false)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.t5.active_to_passive_styletransfer").predict("""transfer Active to Passive:""")
```
## Results
```bash
+---------------------------+
|result |
+---------------------------+
|[a letter is written by me]|
+---------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_active_to_passive_styletransfer|
|Compatibility:|Spark NLP 3.4.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[transfers]|
|Language:|en|
|Size:|264.5 MB|
## Data Source
The original model is from the transformers library:
https://huggingface.co/prithivida/active_to_passive_styletransfer
---
layout: model
title: Luo (Kenya and Tanzania) XLMRobertaForTokenClassification Base Cased model (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_ner
date: 2022-08-01
tags: [luo, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: luo
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-luo` is a Luo (Kenya and Tanzania) model originally trained by `mbeukman`.
## Predicted Entities
`DATE`, `PER`, `ORG`, `LOC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_ner_luo_4.1.0_3.0_1659355137886.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_ner_luo_4.1.0_3.0_1659355137886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_ner","luo") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_ner","luo")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_ner|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|luo|
|Size:|772.6 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-luo
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://github.com/masakhane-io/masakhane-ner
---
layout: model
title: Pipeline to Detect Anatomical and Observation Entities in Chest Radiology Reports (CheXpert)
author: John Snow Labs
name: ner_chexpert_pipeline
date: 2023-03-14
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_chexpert](https://nlp.johnsnowlabs.com/2021/09/30/ner_chexpert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chexpert_pipeline_en_4.3.0_3.2_1678779791404.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chexpert_pipeline_en_4.3.0_3.2_1678779791404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_chexpert_pipeline", "en", "clinical/models")
text = '''FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_chexpert_pipeline", "en", "clinical/models")
val text = "FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.chexpert.pipeline").predict("""FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_large_arabic","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_large_arabic","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.albert_large_arabic").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_embeddings_albert_large_arabic|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|68.0 MB|
|Case sensitive:|false|
## References
- https://huggingface.co/asafaya/albert-large-arabic
- https://oscar-corpus.com/
- http://commoncrawl.org/
- https://dumps.wikimedia.org/backup-index.html
- https://github.com/google-research/albert
- https://www.tensorflow.org/tfrc
- https://github.com/KUIS-AI-Lab/Arabic-ALBERT/
---
layout: model
title: Legal Release Clause Binary Classifier
author: John Snow Labs
name: legclf_release_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `release` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `release`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_release_clause_en_1.0.0_3.2_1660122916319.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_release_clause_en_1.0.0_3.2_1660122916319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[release]|
|[other]|
|[other]|
|[release]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_release_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.88 1.00 0.93 70
release 1.00 0.50 0.67 20
accuracy - - 0.89 90
macro-avg 0.94 0.75 0.80 90
weighted-avg 0.90 0.89 0.87 90
```
---
layout: model
title: Toxic Comment Classification - Small
author: John Snow Labs
name: multiclassifierdl_use_toxic_sm
date: 2021-01-21
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 2.7.1
spark_version: 2.4
tags: [open_source, en, text_classification]
supported: true
annotator: MultiClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.
The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) is working on tools to help improve the online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful, or otherwise likely to make someone leave a discussion). So far they’ve built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content).
Automatically detect identity hate, insult, obscene, severe toxic, threat, or toxic content in SM comments using our out-of-the-box Spark NLP Multiclassifier DL.
We removed the records without any labels in this model. (only 14K+ comments were used to train this model)
## Predicted Entities
`toxic`, `severe_toxic`, `identity_hate`, `insult`, `obscene`, `threat`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_MULTILABEL_TOXIC/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_MULTILABEL_TOXIC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/multiclassifierdl_use_toxic_sm_en_2.7.1_2.4_1611230645484.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/multiclassifierdl_use_toxic_sm_en_2.7.1_2.4_1611230645484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained() \
.setInputCols(["document"])\
.setOutputCol("use_embeddings")
docClassifier = MultiClassifierDLModel.pretrained("multiclassifierdl_use_toxic_sm") \
.setInputCols(["use_embeddings"])\
.setOutputCol("category")\
.setThreshold(0.5)
pipeline = Pipeline(
stages = [
document,
use,
docClassifier
])
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
.setCleanupMode("shrink")
val use = UniversalSentenceEncoder.pretrained()
.setInputCols("document")
.setOutputCol("use_embeddings")
val docClassifier = MultiClassifierDLModel.pretrained("multiclassifierdl_use_toxic_sm")
.setInputCols("use_embeddings")
.setOutputCol("category")
.setThreshold(0.5f)
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
use,
docClassifier
)
)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.toxic.sm").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|multiclassifierdl_use_toxic_sm|
|Compatibility:|Spark NLP 2.7.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[use_embeddings]|
|Output Labels:|[category]|
|Language:|en|
## Data Source
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview
## Benchmarking
```bash
Classification report:
precision recall f1-score support
0 0.56 0.30 0.39 127
1 0.71 0.70 0.70 761
2 0.76 0.72 0.74 824
3 0.55 0.21 0.31 147
4 0.79 0.38 0.51 50
5 0.94 1.00 0.97 1504
micro avg 0.83 0.80 0.81 3413
macro avg 0.72 0.55 0.60 3413
weighted avg 0.81 0.80 0.80 3413
samples avg 0.84 0.83 0.80 3413
```
---
layout: model
title: Detect PHI for Generic Deidentification in Romanian (BERT)
author: John Snow Labs
name: ner_deid_generic_bert
date: 2022-11-22
tags: [licensed, clinical, ro, deidentification, phi, generic, bert]
task: Named Entity Recognition
language: ro
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators to allow a generic model to be trained by using a Deep Learning architecture (Char CNN's - BiLSTM - CRF - word embeddings) inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM CNN.
Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with bert_base_cased embeddings and can detect 7 generic entities.
This NER model is trained with a combination of custom datasets with several data augmentation mechanisms.
## Predicted Entities
`AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.2.2_3.0_1669122326582.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.2.2_3.0_1669122326582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\
.setInputCols(["sentence","token"])\
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter])
text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""
data = spark.createDataFrame([[text]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
````
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter))
val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""
val data = Seq(text).toDS.toDF("text")
val results = pipeline.fit(data).transform(data)
````
{:.nlu-block}
```python
import nlu
nlu.load("ro.med_ner.deid_generic_bert").predict("""
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401""")
```
## Results
```bash
+----------------------------+---------+
|chunk |ner_label|
+----------------------------+---------+
|Spitalul Pentru Ochi de Deal|LOCATION |
|Drumul Oprea Nr |LOCATION |
|972 |LOCATION |
|Vaslui |LOCATION |
|737405 |LOCATION |
|+40(235)413773 |CONTACT |
|25 May 2022 |DATE |
|BUREAN MARIA |NAME |
|77 |AGE |
|Agota Evelyn Tımar |NAME |
|2450502264401 |ID |
+----------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_generic_bert|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ro|
|Size:|16.5 MB|
## References
- Custom John Snow Labs datasets
- Data augmentation techniques
## Benchmarking
```bash
label precision recall f1-score support
AGE 0.95 0.97 0.96 1186
CONTACT 0.99 0.98 0.98 366
DATE 0.96 0.92 0.94 4518
ID 1.00 1.00 1.00 679
LOCATION 0.91 0.90 0.90 1683
NAME 0.93 0.96 0.94 2916
PROFESSION 0.87 0.85 0.86 161
micro-avg 0.94 0.94 0.94 11509
macro-avg 0.94 0.94 0.94 11509
weighted-avg 0.95 0.94 0.94 11509
```
---
layout: model
title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_greek_2 TFWav2Vec2ForCTC from skylord
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_greek_2
date: 2022-09-25
tags: [wav2vec2, el, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: el
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_greek_2` is a Modern Greek (1453-) model originally trained by skylord.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_greek_2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_greek_2_el_4.2.0_3.0_1664112085138.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_greek_2_el_4.2.0_3.0_1664112085138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_greek_2', lang = 'el')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_greek_2", lang = "el")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_greek_2|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|el|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from huxxx657)
author: John Snow Labs
name: roberta_qa_base_finetuned_deletion_squad_10
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-deletion-squad-10` is a English model originally trained by `huxxx657`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_deletion_squad_10_en_4.3.0_3.0_1674216541297.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_deletion_squad_10_en_4.3.0_3.0_1674216541297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_deletion_squad_10","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_deletion_squad_10","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_demo", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_demo", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_demo|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|324.8 MB|
---
layout: model
title: Word2Vec Embeddings in Lombard (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, lmo, open_source]
task: Embeddings
language: lmo
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_lmo_3.4.1_3.0_1647443293397.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_lmo_3.4.1_3.0_1647443293397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","lmo") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","lmo")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("lmo.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|lmo|
|Size:|297.3 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Detect Diseases
author: John Snow Labs
name: ner_diseases_en
date: 2020-03-25
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.4.4
spark_version: 2.4
tags: [ner, en, licensed, clinical]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Pretrained named entity recognition deep learning model for diseases. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
{:.h2_title}
## Predicted Entities
``Disease``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DIAG_PROC/){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_en_2.4.4_2.4_1584452534235.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_en_2.4.4_2.4_1584452534235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %}
```python
...
embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_diseases", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. ']], ["text"]))
```
```scala
...
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_diseases", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val data = Seq("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. ").toDF("text")
val result = pipeline.fit(data).transform(data)
}
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe, or add the ``"Finisher"`` to the end of your pipeline.
```bash
+------------------------------+---------+
|chunk |ner |
+------------------------------+---------+
|the cyst |Disease |
|a large Prolene suture |Disease |
|a very small incisional hernia|Disease |
|the hernia cavity |Disease |
|omentum |Disease |
|the hernia |Disease |
|the wound lesion |Disease |
|The lesion |Disease |
|the existing scar |Disease |
|the cyst |Disease |
|the wound |Disease |
|this cyst down to its base |Disease |
|a small incisional hernia |Disease |
|The cyst |Disease |
|The wound |Disease |
+------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_diseases_en_2.4.4_2.4|
|Type:|ner|
|Compatibility:|Spark NLP 2.4.4+|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence,token, embeddings]|
|Output Labels:|[ner]|
|Language:|[en]|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on i2b2 with ``embeddings_clinical``.
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
{:.h2_title}
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|:--------------|-------:|-----:|-----:|---------:|---------:|---------:|
| 0 | I-Disease | 5014 | 222 | 171 | 0.957601 | 0.96702 | 0.962288 |
| 1 | B-Disease | 6004 | 213 | 159 | 0.965739 | 0.974201 | 0.969952 |
| 2 | Macro-average | 11018 | 435 | 330 | 0.96167 | 0.970611 | 0.96612 |
| 3 | Micro-average | 11018 | 435 | 330 | 0.962019 | 0.97092 | 0.966449 |
```
---
layout: model
title: English BertForQuestionAnswering model (from deepset)
author: John Snow Labs
name: bert_qa_deepset_bert_base_uncased_squad2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad2` is a English model orginally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_deepset_bert_base_uncased_squad2_en_4.0.0_3.0_1654181480200.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_deepset_bert_base_uncased_squad2_en_4.0.0_3.0_1654181480200.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_deepset_bert_base_uncased_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_deepset_bert_base_uncased_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_deepset_bert_base_uncased_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/deepset/bert-base-uncased-squad2
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- https://twitter.com/deepset_ai
- http://www.deepset.ai/jobs
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack/
- https://deepset.ai/german-bert
- https://www.linkedin.com/company/deepset-ai/
- https://github.com/deepset-ai/FARM
- https://deepset.ai/germanquad
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-256-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8_en_4.0.0_3.0_1657184863712.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8_en_4.0.0_3.0_1657184863712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-256-finetuned-squad-seed-8
---
layout: model
title: Legal Forbearance Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_forbearance_agreement_bert
date: 2023-02-02
tags: [en, legal, classification, forbearance, agreement, licensed, bert, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_forbearance_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `forbearance-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`forbearance-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_forbearance_agreement_bert_en_1.0.0_3.0_1675359983427.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_forbearance_agreement_bert_en_1.0.0_3.0_1675359983427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[forbearance-agreement]|
|[other]|
|[other]|
|[forbearance-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_forbearance_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
forbearance-agreement 0.97 1.00 0.99 37
other 1.00 0.99 0.99 73
accuracy - - 0.99 110
macro-avg 0.99 0.99 0.99 110
weighted-avg 0.99 0.99 0.99 110
```
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_fpdm_hier_roberta_FT_newsqa
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_hier_roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_hier_roberta_FT_newsqa_en_4.0.0_3.0_1655728663969.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_hier_roberta_FT_newsqa_en_4.0.0_3.0_1655728663969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_hier_roberta_FT_newsqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_fpdm_hier_roberta_FT_newsqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.news.roberta.qa_fpdm_hier_roberta_ft_newsqa.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_fpdm_hier_roberta_FT_newsqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|458.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/fpdm_hier_roberta_FT_newsqa
---
layout: model
title: Hindi BertForQuestionAnswering Cased model (from roshnir)
author: John Snow Labs
name: bert_qa_mbert_finetuned_mlqa_dev
date: 2022-07-07
tags: [hi, open_source, bert, question_answering]
task: Question Answering
language: hi
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-hi` is a Hindi model originally trained by `roshnir`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_dev_hi_4.0.0_3.0_1657190202881.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_dev_hi_4.0.0_3.0_1657190202881.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_dev","hi") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["मेरा नाम क्या है?", "मेरा नाम क्लारा है और मैं बर्कले में रहता हूं।"]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_dev","hi")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("मेरा नाम क्या है?", "मेरा नाम क्लारा है और मैं बर्कले में रहता हूं।").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_mbert_finetuned_mlqa_dev|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|hi|
|Size:|626.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-hi
---
layout: model
title: Fast Neural Machine Translation Model from English to Philippine Languages
author: John Snow Labs
name: opus_mt_en_phi
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, phi, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `phi`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_phi_xx_2.7.0_2.4_1609169890617.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_phi_xx_2.7.0_2.4_1609169890617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_phi", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_phi", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.phi').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_phi|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Danish Legal Roberta Embeddings
author: John Snow Labs
name: roberta_base_danish_legal
date: 2023-02-16
tags: [da, danish, embeddings, transformer, open_source, legal, tensorflow]
task: Embeddings
language: da
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-danish-roberta-base` is a Danish model originally trained by `joelito`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_danish_legal_da_4.2.4_3.0_1676576630307.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_danish_legal_da_4.2.4_3.0_1676576630307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_base_danish_legal|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|da|
|Size:|416.0 MB|
|Case sensitive:|true|
## References
https://huggingface.co/joelito/legal-danish-roberta-base
---
layout: model
title: BERT Sequence Classification - Classify into News Categories
author: John Snow Labs
name: bert_sequence_classifier_age_news
date: 2021-11-07
tags: [news, classification, bert_for_sequence_classification, en, open_source, ag_news]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 3.3.2
spark_version: 2.4
supported: true
annotator: BertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is imported from `Hugging Face-models`. It is a BERT-Mini fine-tuned version of the `age_news` dataset.
## Predicted Entities
`World`, `Sports`, `Business`, `Sci/Tech`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_en_3.3.2_2.4_1636288849469.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_en_3.3.2_2.4_1636288849469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_age_news', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)
pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])
example = spark.createDataFrame([['Microsoft has taken its first step into the metaverse.']]).toDF("text")
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_age_news", "en")
.setInputCols("document", "token")
.setOutputCol("class")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val example = Seq.empty["Microsoft has taken its first step into the metaverse."].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
```
## Results
```bash
['Sci/Tech']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_age_news|
|Compatibility:|Spark NLP 3.3.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[label]|
|Language:|en|
|Case sensitive:|true|
## Data Source
[https://huggingface.co/mrm8488/bert-mini-finetuned-age_news-classification](https://huggingface.co/mrm8488/bert-mini-finetuned-age_news-classification)
## Benchmarking
```bash
Test set accuracy: 0.93
```
---
layout: model
title: Extract relations between problem, treatment and test entities (ReDL)
author: John Snow Labs
name: redl_clinical_biobert
date: 2021-02-04
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 2.7.3
spark_version: 2.4
tags: [licensed, clinical, en, relation_extraction]
supported: true
annotator: RelationExtractionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract relations like `TrIP` : a certain treatment has improved a medical problem and 7 other such relations between problem, treatment and test entities.
## Predicted Entities
`PIP`, `TeCP`, `TeRP`, `TrAP`, `TrCP`, `TrIP`, `TrNAP`, `TrWP`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_2.7.3_2.4_1612443963755.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_2.7.3_2.4_1612443963755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = sparknlp.annotators.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = NerConverter() \
.setInputCols(["sentences", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentences", "pos_tags", "tokens"]) \
.setOutputCol("dependencies")
# Set a filter on pairs of named entities which will be treated as relation candidates
re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setMaxSyntacticDistance(10)\
.setOutputCol("re_ner_chunks")\
.setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION'])
# The dataset this model is trained to is sentence-wise.
# This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
re_model = RelationExtractionDLModel()\
.pretrained('redl_clinical_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model])
text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
"""
data = spark.createDataFrame([[text]]).toDF("text")
p_model = pipeline.fit(data)
result = p_model.transform(data)
```
```scala
...
val documenter = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = sparknlp.annotators.Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = NerConverter()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val re_ner_chunk_filter = RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
.setRelationPairs(Array("SYMPTOM-EXTERNAL_BODY_PART_OR_REGION"))
// The dataset this model is trained to is sentence-wise.
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val re_model = RelationExtractionDLModel()
.pretrained("redl_clinical_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
""")
```
## Results
```bash
| | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence |
|---:|:-----------|:----------|----------------:|--------------:|:--------------------------------------|:----------|----------------:|--------------:|:-------------------------|-------------:|
| 0 | PIP | PROBLEM | 39 | 67 | gestational diabetes mellitus | PROBLEM | 157 | 160 | T2DM | 0.763447 |
| 1 | PIP | PROBLEM | 39 | 67 | gestational diabetes mellitus | PROBLEM | 289 | 295 | obesity | 0.682246 |
| 2 | PIP | PROBLEM | 117 | 153 | subsequent type two diabetes mellitus | PROBLEM | 187 | 210 | HTG-induced pancreatitis | 0.610396 |
| 3 | PIP | PROBLEM | 117 | 153 | subsequent type two diabetes mellitus | PROBLEM | 264 | 281 | an acute hepatitis | 0.726894 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_clinical_biobert|
|Compatibility:|Healthcare NLP 2.7.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
## Data Source
Trained on 2010 i2b2 relation challenge.
## Benchmarking
```bash
Relation Recall Precision F1 Support
PIP 0.859 0.878 0.869 1435
TeCP 0.629 0.782 0.697 337
TeRP 0.903 0.929 0.916 2034
TrAP 0.872 0.866 0.869 1693
TrCP 0.641 0.677 0.659 340
TrIP 0.517 0.796 0.627 151
TrNAP 0.402 0.672 0.503 112
TrWP 0.257 0.824 0.392 109
Avg. 0.635 0.803 0.691
```
---
layout: model
title: Sentence Entity Resolver for ICD10-CM (general 3 character codes)
author: John Snow Labs
name: sbiobertresolve_icd10cm_generalised
date: 2021-09-29
tags: [licensed, clinical, en, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.2.1
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It predicts ICD codes up to 3 characters (according to ICD10 code structure the first three characters represent general type of the injury or disease).
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_en_3.2.1_3.0_1632938859569.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_en_3.2.1_3.0_1632938859569.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_icd10cm_generalised``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. ```PROBLEM``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icd10cm_generalised","en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver])
data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icd10cm_generalised","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver))
val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10cm_generalised").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""")
```
## Results
```bash
| | chunk | begin | end | entity | code | code_desc | distance | all_k_resolutions | all_k_codes |
|---:|:----------------------------|--------:|------:|:---------|:-------|:---------------------------------------------------------|-----------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------|
| 0 | hypertension | 68 | 79 | PROBLEM | I10 | hypertension | 0 | hypertension:::hypertension (high blood pressure):::h/o: hypertension:::fh: hypertension:::hypertensive heart disease:::labile hypertension:::history of hypertension (situation):::endocrine hypertension | I10:::I15:::Z86:::Z82:::I11:::R03:::Z87:::E27 |
| 1 | chronic renal insufficiency | 83 | 109 | PROBLEM | N18 | chronic renal impairment | 0.014 | chronic renal impairment:::renal insufficiency:::renal failure:::anaemia of chronic renal insufficiency:::impaired renal function disorder:::history of renal insufficiency:::prerenal renal failure:::abnormal renal function:::abnormal renal function | N18:::P96:::N19:::D63:::N28:::Z87:::N17:::N25:::R94 |
| 2 | COPD | 113 | 116 | PROBLEM | J44 | chronic obstructive lung disease (disorder) | 0.1197 | chronic obstructive lung disease (disorder):::chronic obstructive pulmonary disease leaflet given:::chronic pulmonary congestion (disorder):::chronic respiratory failure (disorder):::chronic respiratory insufficiency:::cor pulmonale (chronic):::history of - chronic lung disease (situation) | J44:::Z76:::J81:::J96:::R06:::I27:::Z87 |
| 3 | gastritis | 120 | 128 | PROBLEM | K29 | gastritis | 0 | gastritis:::bacterial gastritis:::parasitic gastritis | K29:::B96:::K93 |
| 4 | TIA | 136 | 138 | PROBLEM | S06 | cerebral concussion | 0.1662 | cerebral concussion:::transient ischemic attack (disorder):::thalamic stroke:::cerebral trauma:::stroke:::traumatic amputation:::spinal cord stroke | S06:::G45:::I63:::S09:::I64:::T14:::G95 |
| 5 | a non-ST elevation MI | 182 | 202 | PROBLEM | I21 | non-st elevation (nstemi) myocardial infarction | 0.1615 | non-st elevation (nstemi) myocardial infarction:::nonruptured cerebral artery dissection:::acute stroke, nonatherosclerotic:::nontraumatic ischemic infarction of muscle, unsp shoulder:::history of nonatherosclerotic stroke without residual deficits:::non-traumatic cerebral hemorrhage | I21:::I67:::I63:::M62:::Z86:::I61 |
| 6 | Guaiac positive stools | 208 | 229 | PROBLEM | R85 | abnormal anal pap | 0.1807 | abnormal anal pap:::straining at stool (finding):::amine test positive:::appendiceal colic:::fecal smearing:::epiploic appendagitis:::diverticulosis of intestine (finding):::appendicitis (disorder):::colostomy present (finding):::thickened anal verge (finding):::anal fissure:::amoebic enteritis:::zenkers diverticulum | R85:::R19:::Z78:::K38:::R15:::K65:::K57:::K37:::Z93:::K62:::K60:::A06:::K22 |
| 7 | mid LAD lesion | 332 | 345 | PROBLEM | I21 | stemi involving left anterior descending coronary artery | 0.1595 | stemi involving left anterior descending coronary artery:::divided left atrium:::disorder of left atrium:::double inlet left ventricle:::left os acromiale:::furuncle of left upper limb:::left anterior fascicular hemiblock (heart rhythm):::aberrant origin of left subclavian artery:::stent in circumflex branch of left coronary artery (finding) | I21:::Q24:::I51:::Q20:::M89:::L02:::I44:::Q27:::Z95 |
| 8 | hypotension | 362 | 372 | PROBLEM | I95 | hypotension | 0 | hypotension:::supine hypotensive syndrome | I95:::O26 |
| 9 | bradycardia | 378 | 388 | PROBLEM | R00 | bradycardia | 0 | bradycardia:::bradycardia (finding):::drug-induced bradycardia:::bradycardia (disorder) | R00:::P29:::T50:::P20 |
| 10 | vagal reaction | 466 | 479 | PROBLEM | G52 | vagus nerve finding | 0.0926 | vagus nerve finding:::vasomotor reaction:::vesicular breathing (finding):::abdominal muscle tone - finding:::agonizing state:::paresthesia (finding):::glossolalia (finding):::tactile alteration (finding) | G52:::I73:::R09:::R19:::R45:::R20:::R41:::R44 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10cm_generalised|
|Compatibility:|Healthcare NLP 3.2.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_chunk_embeddings]|
|Output Labels:|[icd10cm_code]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on ICD10 Clinical Modification dataset with `sbiobert_base_cased_mli` sentence embeddings. https://www.icd10data.com/ICD10CM/Codes/
---
layout: model
title: Pretrained Pipeline for Few-NERD-General NER Model
author: John Snow Labs
name: nerdl_fewnerd_100d_pipeline
date: 2021-12-03
tags: [fewnerd, nerdl, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.3.1
spark_version: 2.4
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on Few-NERD model and it detects :
`PERSON`, `ORGANIZATION`, `LOCATION`, `ART`, `BUILDING`, `PRODUCT`, `EVENT`, `OTHER`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_FEW_NERD/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FewNERD.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_100d_pipeline_en_3.3.1_2.4_1638523061152.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_100d_pipeline_en_3.3.1_2.4_1638523061152.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
fewnerd_pipeline = PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en")
fewnerd_pipeline.annotate("""The Double Down is a sandwich offered by Kentucky Fried Chicken restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).""")
```
```scala
val pipeline = new PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en")
val result = pipeline.fullAnnotate("The Double Down is a sandwich offered by Kentucky Fried Chicken restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).")(0)
```
## Results
```bash
+-----------------------+------------+
|chunk |ner_label |
+-----------------------+------------+
|Kentucky Fried Chicken |ORGANIZATION|
|Anglo-Egyptian War |EVENT |
|battle of Tell El Kebir|EVENT |
|Egypt Medal |OTHER |
|Order of Medjidie |OTHER |
+-----------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|nerdl_fewnerd_100d_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.3.1+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- NerDLModel
- NerConverter
- Finisher
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_cantonese TFWav2Vec2ForCTC from ivanlau
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_cantonese
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_cantonese` is a English model originally trained by ivanlau.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_cantonese_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_cantonese_en_4.2.0_3.0_1664113131550.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_cantonese_en_4.2.0_3.0_1664113131550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_cantonese', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_cantonese", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_cantonese|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Stopwords Remover for Hungarian language (219 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, hu, open_source]
task: Stop Words Removal
language: hu
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_hu_3.4.1_3.0_1646673036995.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_hu_3.4.1_3.0_1646673036995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","hu") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Nem vagy jobb, mint én"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","hu")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Nem vagy jobb, mint én").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("hu.stopwords").predict("""Nem vagy jobb, mint én""")
```
## Results
```bash
+---------+
|result |
+---------+
|[jobb, ,]|
+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|hu|
|Size:|2.1 KB|
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from jteng)
author: John Snow Labs
name: distilbert_qa_finetuned_syllabus
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-syllabus` is a English model originally trained by `jteng`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_syllabus_en_4.3.0_3.0_1672765843797.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_syllabus_en_4.3.0_3.0_1672765843797.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_syllabus","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_syllabus","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_finetuned_syllabus|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/jteng/bert-finetuned-syllabus
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from ardallie)
author: John Snow Labs
name: xlmroberta_ner_ardallie_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `ardallie`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_ardallie_base_finetuned_panx_de_4.1.0_3.0_1660430915952.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_ardallie_base_finetuned_panx_de_4.1.0_3.0_1660430915952.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_ardallie_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_ardallie_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_ardallie_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/ardallie/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: English T5ForConditionalGeneration Cased model (from pitehu)
author: John Snow Labs
name: t5_ner_conll_entityreplace
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5_NER_CONLL_ENTITYREPLACE` is a English model originally trained by `pitehu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ner_conll_entityreplace_en_4.3.0_3.0_1675099568513.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ner_conll_entityreplace_en_4.3.0_3.0_1675099568513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_ner_conll_entityreplace","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_ner_conll_entityreplace","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_ner_conll_entityreplace|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|275.5 MB|
## References
- https://huggingface.co/pitehu/T5_NER_CONLL_ENTITYREPLACE
- https://arxiv.org/pdf/2111.10952.pdf
- https://arxiv.org/pdf/1810.04805.pdf
---
layout: model
title: Legal Dispute resolution Clause Binary Classifier
author: John Snow Labs
name: legclf_dispute_resolution_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `dispute-resolution` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `dispute-resolution`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dispute_resolution_clause_en_1.0.0_3.2_1660122359464.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dispute_resolution_clause_en_1.0.0_3.2_1660122359464.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[dispute-resolution]|
|[other]|
|[other]|
|[dispute-resolution]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_dispute_resolution_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
dispute-resolution 0.84 0.84 0.84 32
other 0.94 0.94 0.94 84
accuracy - - 0.91 116
macro-avg 0.89 0.89 0.89 116
weighted-avg 0.91 0.91 0.91 116
```
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from allenai)
author: John Snow Labs
name: t5_small_next_word_generator_qoogle
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-next-word-generator-qoogle` is a English model originally trained by `allenai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_next_word_generator_qoogle_en_4.3.0_3.0_1675126551905.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_next_word_generator_qoogle_en_4.3.0_3.0_1675126551905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_small_next_word_generator_qoogle","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_small_next_word_generator_qoogle","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_small_next_word_generator_qoogle|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|148.1 MB|
## References
- https://huggingface.co/allenai/t5-small-next-word-generator-qoogle
---
layout: model
title: Translate English to Tagalog Pipeline
author: John Snow Labs
name: translate_en_tl
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, tl, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `tl`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tl_xx_2.7.0_2.4_1609691452242.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tl_xx_2.7.0_2.4_1609691452242.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_tl", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_tl", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.tl').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_tl|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011)
author: John Snow Labs
name: distilbert_token_classifier_autotrain_name_vsv_all_901529445
date: 2023-03-03
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_vsv_all-901529445` is a English model originally trained by `ismail-lucifer011`.
## Predicted Entities
`OOV`, `Name`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_vsv_all_901529445_en_4.3.1_3.0_1677881751372.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_vsv_all_901529445_en_4.3.1_3.0_1677881751372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_vsv_all_901529445","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_vsv_all_901529445","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_autotrain_name_vsv_all_901529445|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ismail-lucifer011/autotrain-name_vsv_all-901529445
---
layout: model
title: BioBERT Sentence Embeddings (PMC)
author: John Snow Labs
name: sent_biobert_pmc_base_cased
date: 2020-09-19
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.2
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pmc_base_cased_en_2.6.2_2.4_1600532770743.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pmc_base_cased_en_2.6.2_2.4_1600532770743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pmc_base_cased", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pmc_base_cased", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.biobert.pmc_base_cased').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
sentence en_embed_sentence_biobert_pmc_base_cased_embeddings
I hate cancer [0.34035101532936096, 0.04413360357284546, -0....
Antibiotics aren't painkiller [0.4397204518318176, 0.066007100045681, -0.114...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_biobert_pmc_base_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.2|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert)
---
layout: model
title: BERT Embeddings trained on MEDLINE/PubMed and fine-tuned on SQuAD 2.0
author: John Snow Labs
name: bert_pubmed_squad2
date: 2021-08-30
tags: [en, open_source, squad_2_dataset, medline_pubmed_dataset, bert_embeddings]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/pubmed/1 and fine-tuned on SQuAD 2.0.
This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pubmed_squad2_en_3.2.0_3.0_1630323544592.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pubmed_squad2_en_3.2.0_3.0_1630323544592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = BertEmbeddings.pretrained("bert_pubmed_squad2", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
```
```scala
val embeddings = BertEmbeddings.pretrained("bert_pubmed_squad2", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert.pubmed_squad2').predict(text, output_level='token')
embeddings_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pubmed_squad2|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Case sensitive:|false|
## Data Source
[1]: [Wikipedia dataset](https://dumps.wikimedia.org/)
[2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb)
[3]: [Stanford Queston Answering (SQuAD 2.0) dataset](https://rajpurkar.github.io/SQuAD-explorer/)
[4]: [MEDLINE/PubMed dataset](https://www.nlm.nih.gov/databases/download/pubmed_medline.html)
This Model has been imported from: https://tfhub.dev/google/experts/bert/pubmed/squad2/2
---
layout: model
title: Bashkir asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt TFWav2Vec2ForCTC from AigizK
author: John Snow Labs
name: asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt
date: 2022-09-24
tags: [wav2vec2, ba, audio, open_source, asr]
task: Automatic Speech Recognition
language: ba
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt` is a Bashkir model originally trained by AigizK.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt_ba_4.2.0_3.0_1664040309179.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt_ba_4.2.0_3.0_1664040309179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt", "ba")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt", "ba")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|ba|
|Size:|1.2 GB|
---
layout: model
title: Hindi BertForMaskedLM Base Cased model (from Geotrend)
author: John Snow Labs
name: bert_embeddings_base_hi_cased
date: 2022-12-02
tags: [hi, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: hi
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-hi-cased` is a Hindi model originally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_hi_cased_hi_4.2.4_3.0_1670017763072.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_hi_cased_hi_4.2.4_3.0_1670017763072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_hi_cased","hi") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_hi_cased","hi")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_hi_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|hi|
|Size:|339.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-hi-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: German BertForMaskedLM Base Cased model (from Geotrend)
author: John Snow Labs
name: bert_embeddings_base_de_cased
date: 2022-12-02
tags: [de, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: de
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-de-cased` is a German model originally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_de_cased_de_4.2.4_3.0_1670016505471.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_de_cased_de_4.2.4_3.0_1670016505471.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_de_cased","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_de_cased","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_de_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|398.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-de-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Translate English to Twi Pipeline
author: John Snow Labs
name: translate_en_tw
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, tw, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `tw`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tw_xx_2.7.0_2.4_1609691518744.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tw_xx_2.7.0_2.4_1609691518744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_tw", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_tw", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.tw').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_tw|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Marscen)
author: John Snow Labs
name: distilbert_qa_marscen_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Marscen`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_marscen_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768784222.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_marscen_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768784222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_marscen_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_marscen_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_marscen_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Marscen/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English BertForQuestionAnswering Cased model (from spasis)
author: John Snow Labs
name: bert_qa_spasis_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `spasis`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spasis_finetuned_squad_en_4.0.0_3.0_1657186715488.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spasis_finetuned_squad_en_4.0.0_3.0_1657186715488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spasis_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spasis_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_spasis_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/spasis/bert-finetuned-squad
---
layout: model
title: Extract Test Entities from Voice of the Patient Documents (embeddings_clinical_medium)
author: John Snow Labs
name: ner_vop_test_emb_clinical_medium
date: 2023-06-06
tags: [licensed, clinical, ner, en, vop, test]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts test mentions from the documents transferred from the patient’s own sentences.
## Predicted Entities
`VitalTest`, `Test`, `Measurements`, `TestResult`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_emb_clinical_medium_en_4.4.3_3.0_1686076924102.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_emb_clinical_medium_en_4.4.3_3.0_1686076924102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_test_emb_clinical_medium", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_test_emb_clinical_medium", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:---------------|:------------|
| thyroid levels | Test |
| blood test | Test |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_test_emb_clinical_medium|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.8 MB|
|Dependencies:|embeddings_clinical_medium|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
VitalTest 162 32 10 172 0.84 0.94 0.89
Test 1040 118 168 1208 0.90 0.86 0.88
Measurements 136 22 50 186 0.86 0.73 0.79
TestResult 360 109 164 524 0.77 0.69 0.73
macro_avg 1698 281 392 2090 0.84 0.80 0.82
micro_avg 1698 281 392 2090 0.86 0.81 0.84
```
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from ydshieh)
author: John Snow Labs
name: roberta_qa_ydshieh_base_squad2
date: 2022-12-02
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `ydshieh`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ydshieh_base_squad2_en_4.2.4_3.0_1669986831741.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ydshieh_base_squad2_en_4.2.4_3.0_1669986831741.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ydshieh_base_squad2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ydshieh_base_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_ydshieh_base_squad2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/ydshieh/roberta-base-squad2
- https://github.com/deepset-ai/FARM/issues/552
- https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py
- https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py
- https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
- https://github.com/deepset-ai/haystack/
- https://workablehr.s3.amazonaws.com/uploads/account/logo/476306/logo
- https://deepset.ai/german-bert
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/FARM
- https://github.com/deepset-ai/haystack/
- https://twitter.com/deepset_ai
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- http://www.deepset.ai/jobs
---
layout: model
title: Pipeline to Extract the Names of Drugs & Chemicals
author: John Snow Labs
name: ner_chemd_clinical_pipeline
date: 2023-03-14
tags: [chemdner, chemd, ner, clinical, en, licensed]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_chemd_clinical](https://nlp.johnsnowlabs.com/2021/11/04/ner_chemd_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemd_clinical_pipeline_en_4.3.0_3.2_1678778578175.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemd_clinical_pipeline_en_4.3.0_3.2_1678778578175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_chemd_clinical_pipeline", "en", "clinical/models")
text = '''Isolation, Structure Elucidation, and Iron-Binding Properties of Lystabactins, Siderophores Isolated from a Marine Pseudoalteromonas sp. The marine bacterium Pseudoalteromonas sp. S2B, isolated from the Gulf of Mexico after the Deepwater Horizon oil spill, was found to produce lystabactins A, B, and C (1-3), three new siderophores. The structures were elucidated through mass spectrometry, amino acid analysis, and NMR. The lystabactins are composed of serine (Ser), asparagine (Asn), two formylated/hydroxylated ornithines (FOHOrn), dihydroxy benzoic acid (Dhb), and a very unusual nonproteinogenic amino acid, 4,8-diamino-3-hydroxyoctanoic acid (LySta). The iron-binding properties of the compounds were investigated through a spectrophotometric competition.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_chemd_clinical_pipeline", "en", "clinical/models")
val text = "Isolation, Structure Elucidation, and Iron-Binding Properties of Lystabactins, Siderophores Isolated from a Marine Pseudoalteromonas sp. The marine bacterium Pseudoalteromonas sp. S2B, isolated from the Gulf of Mexico after the Deepwater Horizon oil spill, was found to produce lystabactins A, B, and C (1-3), three new siderophores. The structures were elucidated through mass spectrometry, amino acid analysis, and NMR. The lystabactins are composed of serine (Ser), asparagine (Asn), two formylated/hydroxylated ornithines (FOHOrn), dihydroxy benzoic acid (Dhb), and a very unusual nonproteinogenic amino acid, 4,8-diamino-3-hydroxyoctanoic acid (LySta). The iron-binding properties of the compounds were investigated through a spectrophotometric competition."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:-----------------------------------|--------:|------:|:-------------|-------------:|
| 0 | Lystabactins | 65 | 76 | FAMILY | 0.9841 |
| 1 | lystabactins A, B, and C | 278 | 301 | MULTIPLE | 0.813429 |
| 2 | amino acid | 392 | 401 | FAMILY | 0.74585 |
| 3 | lystabactins | 426 | 437 | FAMILY | 0.8007 |
| 4 | serine | 455 | 460 | TRIVIAL | 0.9924 |
| 5 | Ser | 463 | 465 | FORMULA | 0.9999 |
| 6 | asparagine | 469 | 478 | TRIVIAL | 0.9795 |
| 7 | Asn | 481 | 483 | FORMULA | 0.9999 |
| 8 | formylated/hydroxylated ornithines | 491 | 524 | FAMILY | 0.50085 |
| 9 | FOHOrn | 527 | 532 | FORMULA | 0.509 |
| 10 | dihydroxy benzoic acid | 536 | 557 | SYSTEMATIC | 0.6346 |
| 11 | amino acid | 602 | 611 | FAMILY | 0.4204 |
| 12 | 4,8-diamino-3-hydroxyoctanoic acid | 614 | 647 | SYSTEMATIC | 0.9124 |
| 13 | LySta | 650 | 654 | ABBREVIATION | 0.9193 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_chemd_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Stop Words Cleaner for Yoruba
author: John Snow Labs
name: stopwords_yo
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: yo
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, yo]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_yo_yo_2.5.4_2.4_1594742440695.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_yo_yo_2.5.4_2.4_1594742440695.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_yo", "yo") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_yo", "yo")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera."""]
stopword_df = nlu.load('yo.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=3, result='Yato', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=5, end=6, result='si', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=8, end=11, result='jijẹ', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=13, end=15, result='ọba', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=17, end=21, result='ariwa', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_yo|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|yo|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: English BertForMaskedLM Base Uncased model (from mlcorelib)
author: John Snow Labs
name: bert_embeddings_deberta_base_uncased
date: 2022-12-06
tags: [en, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-base-uncased` is a English model originally trained by `mlcorelib`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_deberta_base_uncased_en_4.2.4_3.0_1670326237283.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_deberta_base_uncased_en_4.2.4_3.0_1670326237283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_deberta_base_uncased","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_deberta_base_uncased","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_deberta_base_uncased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|409.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/mlcorelib/deberta-base-uncased
- https://arxiv.org/abs/1810.04805
- https://github.com/google-research/bert
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
---
layout: model
title: English RobertaForQuestionAnswering (from huxxx657)
author: John Snow Labs
name: roberta_qa_roberta_base_finetuned_squad_2
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad-2` is a English model originally trained by `huxxx657`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad_2_en_4.0.0_3.0_1655734456241.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad_2_en_4.0.0_3.0_1655734456241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_squad_2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_finetuned_squad_2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_v2.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_finetuned_squad_2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|437.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/huxxx657/roberta-base-finetuned-squad-2
---
layout: model
title: Finnish asr_wav2vec2_large_xlsr_finnish TFWav2Vec2ForCTC from birgermoell
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_finnish
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_finnish` is a Finnish model originally trained by birgermoell.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_finnish_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_finnish_fi_4.2.0_3.0_1664021434692.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_finnish_fi_4.2.0_3.0_1664021434692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_finnish', lang = 'fi')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_finnish", lang = "fi")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_finnish|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_8_h_768
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-8_H-768` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_768_zh_4.2.4_3.0_1670326074741.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_768_zh_4.2.4_3.0_1670326074741.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_768","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_768","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_8_h_768|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|277.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-8_H-768
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Detect Clinical Events (Admissions)
author: John Snow Labs
name: ner_events_admission_clinical
date: 2021-03-01
tags: [ner, licensed, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.7.4
spark_version: 2.4
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model can be used to detect clinical events in medical text, with a focus on admission entities.
## Predicted Entities
`DATE`, `TIME`, `PROBLEM`, `TEST`, `TREATMENT`, `OCCURENCE`, `CLINICAL_DEPT`, `EVIDENTIAL`, `DURATION`, `FREQUENCY`, `ADMISSION`, `DISCHARGE`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_admission_clinical_en_2.7.4_2.4_1614582648104.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_admission_clinical_en_2.7.4_2.4_1614582648104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_events_admission_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["The patient presented to the emergency room last evening"]], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_events_admission_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""The patient presented to the emergency room last evening""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.admission_events").predict("""The patient presented to the emergency room last evening""")
```
## Results
```bash
+----+-----------------------------+---------+---------+-----------------+
| | chunk | begin | end | entity |
+====+=============================+=========+=========+=================+
| 0 | presented | 12 | 20 | EVIDENTIAL |
+----+-----------------------------+---------+---------+-----------------+
| 1 | the emergency room | 25 | 42 | CLINICAL_DEPT |
+----+-----------------------------+---------+---------+-----------------+
| 2 | last evening | 44 | 55 | DATE |
+----+-----------------------------+---------+---------+-----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_events_admission_clinical|
|Type:|ner|
|Compatibility:|Healthcare NLP 2.7.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
Trained on augmented/enriched i2b2 events data with clinical_embeddings. The data for Admissions has been enriched specifically.
## Benchmarking
```bash
label tp fp fn prec rec f1
I-TIME 42 6 9 0.875 0.8235294 0.8484849
I-TREATMENT 1134 111 312 0.9108434 0.7842324 0.8428094
B-OCCURRENCE 406 344 382 0.5413333 0.51522845 0.52795845
I-DURATION 160 42 71 0.7920792 0.6926407 0.73903
B-DATE 500 32 49 0.9398496 0.9107468 0.92506933
I-DATE 309 54 49 0.8512397 0.8631285 0.8571429
B-ADMISSION 206 1 2 0.9951691 0.99038464 0.9927711
I-PROBLEM 2394 390 412 0.85991377 0.85317177 0.8565295
B-CLINICAL_DEPT 327 64 77 0.8363171 0.8094059 0.8226415
B-TIME 44 12 15 0.78571427 0.7457627 0.76521736
I-CLINICAL_DEPT 597 62 78 0.90591806 0.8844444 0.8950525
B-PROBLEM 1643 260 252 0.86337364 0.86701846 0.86519223
I-FREQUENCY 35 21 39 0.625 0.47297296 0.5384615
I-TEST 1082 171 117 0.86352754 0.9024187 0.8825449
B-TEST 781 125 127 0.8620309 0.86013216 0.86108047
B-TREATMENT 1283 176 202 0.87936944 0.8639731 0.87160325
B-DISCHARGE 155 0 1 1.0 0.99358976 0.99678457
B-EVIDENTIAL 269 25 75 0.914966 0.78197676 0.84326017
B-DURATION 97 43 44 0.69285715 0.6879433 0.6903914
B-FREQUENCY 70 16 33 0.81395346 0.6796116 0.7407407
tp: 11841 fp: 2366 fn: 2680 labels: 22
Macro-average prec: 0.8137135, rec: 0.7533389, f1: 0.7823631
Micro-average prec: 0.83346236, rec: 0.8154397, f1: 0.8243525
```
---
layout: model
title: Legal Whereas Clause Binary Classifier
author: John Snow Labs
name: legclf_cuad_whereas_clause
date: 2022-11-25
tags: [whereas, en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `whereas` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`whereas`, `other`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_whereas_clause_en_1.0.0_3.0_1669379828062.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_whereas_clause_en_1.0.0_3.0_1669379828062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[whereas]|
|[other]|
|[other]|
|[whereas]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_cuad_whereas_clause|
|Type:|legal|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.4 MB|
## References
In-house annotations on CUAD dataset
## Benchmarking
```bash
label precision recall f1-score support
other 0.98 0.94 0.96 67
whereas 0.91 0.98 0.94 41
accuracy - - 0.95 108
macro-avg 0.95 0.96 0.95 108
weighted-avg 0.96 0.95 0.95 108
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from T-qualizer)
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_advers
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-advers` is a English model originally trained by `T-qualizer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_advers_en_4.0.0_3.0_1654723842338.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_advers_en_4.0.0_3.0_1654723842338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_advers","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_advers","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.base_uncased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_advers|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/T-qualizer/distilbert-base-uncased-finetuned-advers
---
layout: model
title: Applicable Law Clause NER Model
author: John Snow Labs
name: legner_applicable_law_clause
date: 2023-01-12
tags: [en, ner, licensed, applicable_law]
task: Named Entity Recognition
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a NER model aimed to be used in `applicable_law` clauses to retrieve entities as `APPLIC_LAW`. Make sure you run this model only on `applicable_law` clauses after you filter them using `legclf_applicable_law_cuad` model.
## Predicted Entities
`APPLIC_LAW`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_applicable_law_clause_en_1.0.0_3.0_1673558480167.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_applicable_law_clause_en_1.0.0_3.0_1673558480167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained("legner_applicable_law_clause", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""ELECTRAMECCANICA VEHICLES CORP., an entity incorporated under the laws of the Province of British Columbia, Canada, with an address of Suite 102 East 1st Avenue, Vancouver, British Columbia, Canada, V5T 1A4 ("EMV")""" ]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+----------------------------------------+----------+----------+
|chunk |ner_label |confidence|
+----------------------------------------+----------+----------+
|laws of the Province of British Columbia|APPLIC_LAW|0.95625716|
+----------------------------------------+----------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_applicable_law_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|1.1 MB|
## References
In-house dataset
## Benchmarking
```bash
label precision recall f1-score support
B-APPLIC_LAW 0.90 0.89 0.90 84
I-APPLIC_LAW 0.98 0.93 0.96 425
micro-avg 0.97 0.93 0.95 509
macro-avg 0.94 0.91 0.93 509
weighted-avg 0.97 0.93 0.95 509
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from laampt)
author: John Snow Labs
name: distilbert_qa_laampt_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `laampt`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_laampt_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771910439.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_laampt_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771910439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_laampt_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_laampt_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_laampt_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/laampt/distilbert-base-uncased-finetuned-squad
---
layout: model
title: German asr_wav2vec2_base_german TFWav2Vec2ForCTC from aware-ai
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_german
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_german` is a German model originally trained by aware-ai.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_german_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_german_de_4.2.0_3.0_1664099298596.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_german_de_4.2.0_3.0_1664099298596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_german', lang = 'de')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_german", lang = "de")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_german|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English BertForQuestionAnswering model (from NeuML)
author: John Snow Labs
name: bert_qa_bert_small_cord19_squad2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-small-cord19-squad2` is a English model orginally trained by `NeuML`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_cord19_squad2_en_4.0.0_3.0_1654184738698.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_cord19_squad2_en_4.0.0_3.0_1654184738698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_small_cord19_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_small_cord19_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_cord19.bert.small").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_small_cord19_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|130.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/NeuML/bert-small-cord19-squad2
---
layout: model
title: Spanish Bert Embeddings (from amine)
author: John Snow Labs
name: bert_embeddings_bert_base_5lang_cased
date: 2022-04-11
tags: [bert, embeddings, es, open_source]
task: Embeddings
language: es
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-5lang-cased` is a Spanish model orginally trained by `amine`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_es_3.4.2_3.0_1649671304061.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_es_3.4.2_3.0_1649671304061.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","es") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","es")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Me encanta chispa nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.embed.bert_base_5lang_cased").predict("""Me encanta chispa nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_5lang_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|es|
|Size:|464.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/amine/bert-base-5lang-cased
- https://cloud.google.com/compute/docs/machine-types#n1_machine_type
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from saraks)
author: John Snow Labs
name: distilbert_qa_cuad_document_name_08_25
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-document_name-08-25` is a English model originally trained by `saraks`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_document_name_08_25_en_4.3.0_3.0_1672766062646.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_document_name_08_25_en_4.3.0_3.0_1672766062646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_document_name_08_25","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_document_name_08_25","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_cuad_document_name_08_25|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/saraks/cuad-distil-document_name-08-25
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from srmukundb)
author: John Snow Labs
name: distilbert_qa_srmukundb_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `srmukundb`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_srmukundb_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772870213.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_srmukundb_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772870213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_srmukundb_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_srmukundb_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_srmukundb_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/srmukundb/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English image_classifier_vit_asl ViTForImageClassification from akahana
author: John Snow Labs
name: image_classifier_vit_asl
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_asl` is a English model originally trained by akahana.
## Predicted Entities
`E`, `del`, `X`, `N`, `T`, `Y`, `J`, `U`, `F`, `A`, `M`, `I`, `G`, `nothing`, `V`, `Q`, `L`, `space`, `B`, `P`, `C`, `H`, `W`, `K`, `R`, `O`, `D`, `Z`, `S`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_asl_en_4.1.0_3.0_1660166442859.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_asl_en_4.1.0_3.0_1660166442859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_asl", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_asl", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_asl|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|322.0 MB|
---
layout: model
title: Detect Assertion Status (assertion_dl_healthcare)
author: John Snow Labs
name: assertion_dl_healthcare
class: AssertionDLModel
reference embedding: healthcare_embeddings
language: en
nav_key: models
repository: clinical/models
date: 2020-09-23
task: Assertion Status
edition: Healthcare NLP 2.6.0
spark_version: 2.4
tags: [clinical,licensed,assertion,en]
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Assertion of Clinical Entities based on Deep Learning.
## Predicted Entities
`hypothetical`, `present`, `absent`, `possible`, `conditional`, `associated_with_someone_else`.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_healthcare_en_2.6.0_2.4_1600849811713.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_healthcare_en_2.6.0_2.4_1600849811713.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, AssertionDLModel.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
clinical_assertion = AssertionDLModel.pretrained("assertion_dl_healthcare","en","clinical/models")\
.setInputCols(["document","ner_chunk","embeddings"])\
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion])
model = nlpPipeline.fit(spark.createDataFrame([['Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain']]).toDF("text"))
results = model.transform(data)
```
```scala
...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_healthcare","en","clinical/models")
.setInputCols("document","ner_chunk","embeddings")
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion))
val data = Seq("Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert.healthcare").predict("""Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain""")
```
{:.h2_title}
## Result
```bash
| | chunks | entities| assertion |
|--:|-----------:|--------:|------------:|
| 0 | a headache | PROBLEM | present |
| 1 | anxious | PROBLEM | conditional |
| 2 | alopecia | PROBLEM | absent |
| 3 | pain | PROBLEM | absent |
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|----------------------------------|
| Name: | assertion_dl_healthcare |
| Type: | AssertionDLModel |
| Compatibility: | 2.6.0 |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | [document, chunk, word_embeddings] |
|Output labels: | [assertion] |
| Language: | en |
| Case sensitive: | False |
| Dependencies: | embeddings_healthcare_100d |
{:.h2_title}
## Data Source
Trained using ``embeddings_clinical`` on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text from https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
{:.h2_title}
## Benchmarking
```bash
label prec rec f1
absent 0.9289 0.9466 0.9377
present 0.9433 0.9559 0.9496
conditional 0.6888 0.5 0.5794
associated_with_someone_else 0.9285 0.9122 0.9203
hypothetical 0.9079 0.8654 0.8862
possible 0.7 0.6146 0.6545
macro-avg 0.8496 0.7991 0.8236
micro-avg 0.9245 0.9245 0.9245
```
---
layout: model
title: English asr_wav2vec2_base_100h_ngram TFWav2Vec2ForCTC from saahith
author: John Snow Labs
name: asr_wav2vec2_base_100h_ngram
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_ngram` is a English model originally trained by saahith.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_100h_ngram_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_ngram_en_4.2.0_3.0_1664042339482.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_ngram_en_4.2.0_3.0_1664042339482.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_100h_ngram", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_100h_ngram", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_100h_ngram|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|227.9 MB|
---
layout: model
title: IndicBERT - Albert for 12 major Indian languages
author: John Snow Labs
name: albert_indic
date: 2022-01-26
tags: [open_source, albert, as, bn, en, gu, kn, ml, mr, or, pa, ta, te, xx]
task: Embeddings
language: xx
edition: Spark NLP 3.4.0
spark_version: 3.0
supported: true
annotator: AlBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
IndicBERT is a multilingual ALBERT model pretrained exclusively on 12 major Indian languages. It is pre-trained on our novel monolingual corpus of around 9 billion tokens and subsequently evaluated on a set of diverse tasks. IndicBERT has much fewer parameters than other multilingual models (mBERT, XLM-R etc.) while it also achieves a performance on-par or better than these models.
The 12 languages covered by IndicBERT are: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_indic_xx_3.4.0_3.0_1643211494926.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_indic_xx_3.4.0_3.0_1643211494926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = AlbertEmbeddings.pretrained("albert_indic","xx") \
.setInputCols(["document",'token'])\
.setOutputCol("embeddings")\
embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
])
data = spark.createDataFrame([
["கர்நாடக சட்டப் பேரவையில் வெற்றி பெற்ற எம்எல்ஏக்கள் இன்று பதவியேற்றுக் கொண்ட நிலையில் , காங்கிரஸ் எம்எல்ஏ ஆனந்த் சிங் க்கள் ஆப்சென்ட் ஆகி அதிர்ச்சியை ஏற்படுத்தியுள்ளார் . உச்சநீதிமன்ற உத்தரவுப்படி இன்று மாலை முதலமைச்சர் எடியூரப்பா இன்று நம்பிக்கை வாக்கெடுப்பு நடத்தி பெரும்பான்மையை நிரூபிக்க உச்சநீதிமன்றம் உத்தரவிட்டது ."],
]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
```
```scala
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.AlbertEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = AlbertEmbeddings.pretrained("albert_indic", "xx")
.setInputCols("token", "document")
.setOutputCol("embeddings")
val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
))
val data = Seq("கர்நாடக சட்டப் பேரவையில் வெற்றி பெற்ற எம்எல்ஏக்கள் இன்று பதவியேற்றுக் கொண்ட நிலையில் , காங்கிரஸ் எம்எல்ஏ ஆனந்த் சிங் க்கள் ஆப்சென்ட் ஆகி அதிர்ச்சியை ஏற்படுத்தியுள்ளார் . உச்சநீதிமன்ற உத்தரவுப்படி இன்று மாலை முதலமைச்சர் எடியூரப்பா இன்று நம்பிக்கை வாக்கெடுப்பு நடத்தி பெரும்பான்மையை நிரூபிக்க உச்சநீதிமன்றம் உத்தரவிட்டது .")
.toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.embed.albert.indic").predict("""கர்நாடக சட்டப் பேரவையில் வெற்றி பெற்ற எம்எல்ஏக்கள் இன்று பதவியேற்றுக் கொண்ட நிலையில் , காங்கிரஸ் எம்எல்ஏ ஆனந்த் சிங் க்கள் ஆப்சென்ட் ஆகி அதிர்ச்சியை ஏற்படுத்தியுள்ளார் . உச்சநீதிமன்ற உத்தரவுப்படி இன்று மாலை முதலமைச்சர் எடியூரப்பா இன்று நம்பிக்கை வாக்கெடுப்பு நடத்தி பெரும்பான்மையை நிரூபிக்க உச்சநீதிமன்றம் உத்தரவிட்டது .""")
```
## Results
```bash
+--------------------------------------------------------------------------------+
| result|
+--------------------------------------------------------------------------------+
|[0.2693195641040802,-0.6446362733840942,-0.05138964205980301,0.06030936539173...|
|[0.027906809002161026,-0.37459731101989746,-0.08371371030807495,-0.0869174525...|
|[0.3804604113101959,-0.7870151400566101,0.08463867008686066,-0.30186718702316...|
|[0.15204764902591705,-0.26839596033096313,0.07375998795032501,-0.131638795137...|
|[0.1482795625925064,-0.221298485994339,-0.022987276315689087,-0.2132280170917...|
+--------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_indic|
|Compatibility:|Spark NLP 3.4.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Size:|128.3 MB|
## References
The model was exported from transformers and is based on https://github.com/AI4Bharat/indic-bert
---
layout: model
title: English RobertaForQuestionAnswering (from comacrae)
author: John Snow Labs
name: roberta_qa_roberta_paraphrasev3
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-paraphrasev3` is a English model originally trained by `comacrae`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_paraphrasev3_en_4.0.0_3.0_1655738199528.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_paraphrasev3_en_4.0.0_3.0_1655738199528.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_paraphrasev3","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_paraphrasev3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.roberta.paraphrasev3.by_comacrae").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_paraphrasev3|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/comacrae/roberta-paraphrasev3
---
layout: model
title: English image_classifier_vit_base_patch32_384_finetuned_eurosat ViTForImageClassification from keithanpai
author: John Snow Labs
name: image_classifier_vit_base_patch32_384_finetuned_eurosat
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch32_384_finetuned_eurosat` is a English model originally trained by keithanpai.
## Predicted Entities
`dff`, `bklf`, `nvf`, `vascf`, `akiecf`, `bccf`, `melf`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch32_384_finetuned_eurosat_en_4.1.0_3.0_1660172185841.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch32_384_finetuned_eurosat_en_4.1.0_3.0_1660172185841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_base_patch32_384_finetuned_eurosat", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_base_patch32_384_finetuned_eurosat", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_base_patch32_384_finetuned_eurosat|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|328.4 MB|
---
layout: model
title: Part of Speech for Irish
author: John Snow Labs
name: pos_ud_idt
date: 2021-03-09
tags: [part_of_speech, open_source, irish, pos_ud_idt, ga]
task: Part of Speech Tagging
language: ga
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- ADP
- NOUN
- DET
- AUX
- PRON
- VERB
- SCONJ
- PART
- ADV
- PUNCT
- CCONJ
- ADJ
- PROPN
- NUM
- X
- SYM
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_idt_ga_3.0.0_3.0_1615292201208.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_idt_ga_3.0.0_3.0_1615292201208.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_idt", "ga") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['Dia duit ó John Labs Sneachta! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_idt", "ga")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("Dia duit ó John Labs Sneachta! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Dia duit ó John Labs Sneachta! ""]
token_df = nlu.load('ga.pos').predict(text)
token_df
```
## Results
```bash
token pos
0 Dia NOUN
1 duit NOUN
2 ó ADP
3 John PROPN
4 Labs PROPN
5 Sneachta NOUN
6 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_idt|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|ga|
---
layout: model
title: BioBERT Embeddings (Pubmed Large)
author: John Snow Labs
name: biobert_pubmed_large_cased
date: 2020-09-19
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.2
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pubmed_large_cased_en_2.6.2_2.4_1600529365263.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pubmed_large_cased_en_2.6.2_2.4_1600529365263.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("biobert_pubmed_large_cased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("biobert_pubmed_large_cased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I hate cancer").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer"]
embeddings_df = nlu.load('en.embed.biobert.pubmed_large_cased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_biobert_pubmed_large_cased_embeddings
I [-0.041047871112823486, 0.24242812395095825, 0...
hate [-0.6859451532363892, -0.45743268728256226, -0...
cancer [-0.12403186410665512, 0.6688604354858398, -0....
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|biobert_pubmed_large_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.2|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|1024|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert)
---
layout: model
title: English image_classifier_vit_pond_image_classification_1 ViTForImageClassification from SummerChiam
author: John Snow Labs
name: image_classifier_vit_pond_image_classification_1
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_1` is a English model originally trained by SummerChiam.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_1_en_4.1.0_3.0_1660165744277.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_1_en_4.1.0_3.0_1660165744277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_pond_image_classification_1", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_pond_image_classification_1", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_pond_image_classification_1|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Laws Spanish Named Entity Recognition (from `hackathon-pln-es`)
author: John Snow Labs
name: roberta_ner_jurisbert_finetuning_ner
date: 2022-05-20
tags: [roberta, ner, token_classification, es, open_source]
task: Named Entity Recognition
language: es
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: RoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `jurisbert-finetuning-ner` is a Spanish model orginally trained by `hackathon-pln-es`.
## Predicted Entities
`TRAT_INTL`, `LEY`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_jurisbert_finetuning_ner_es_3.4.4_3.0_1653046369327.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_jurisbert_finetuning_ner_es_3.4.4_3.0_1653046369327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_jurisbert_finetuning_ner","es") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Me encanta Spark PNL"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_jurisbert_finetuning_ner","es")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Me encanta Spark PNL").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_ner_jurisbert_finetuning_ner|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|464.4 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
https://huggingface.co/hackathon-pln-es/jurisbert-finetuning-ner
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Rocketknight1)
author: John Snow Labs
name: distilbert_qa_rocketknight1_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Rocketknight1`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_rocketknight1_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769088913.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_rocketknight1_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769088913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rocketknight1_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rocketknight1_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_rocketknight1_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Rocketknight1/distilbert-base-uncased-finetuned-squad
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from gbennett)
author: John Snow Labs
name: xlmroberta_ner_gbennett_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `gbennett`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_gbennett_base_finetuned_panx_de_4.1.0_3.0_1660433217069.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_gbennett_base_finetuned_panx_de_4.1.0_3.0_1660433217069.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_gbennett_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_gbennett_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_gbennett_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/gbennett/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Translate English to West Germanic languages Pipeline
author: John Snow Labs
name: translate_en_gmw
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, gmw, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `gmw`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_gmw_xx_2.7.0_2.4_1609689808163.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_gmw_xx_2.7.0_2.4_1609689808163.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_gmw", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_gmw", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.gmw').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_gmw|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Fees and expenses Clause Binary Classifier (md)
author: John Snow Labs
name: legclf_fees_and_expenses_md
date: 2023-01-11
tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `fees-and-expenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `fees-and-expenses`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_fees_and_expenses_md_en_1.0.0_3.0_1673460290897.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_fees_and_expenses_md_en_1.0.0_3.0_1673460290897.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[fees-and-expenses]|
|[other]|
|[other]|
|[fees-and-expenses]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_fees_and_expenses_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
precision recall f1-score support
miscellaneous-provisions 0.75 0.75 0.75 24
other 0.85 0.85 0.85 39
accuracy 0.81 63
macro avg 0.80 0.80 0.80 63
weighted avg 0.81 0.81 0.81 63
```
---
layout: model
title: Pipeline to Detect Drugs - Generalized Single Entity (ner_drugs_greedy)
author: John Snow Labs
name: ner_drugs_greedy_pipeline
date: 2023-03-15
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_drugs_greedy](https://nlp.johnsnowlabs.com/2021/03/31/ner_drugs_greedy_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_pipeline_en_4.3.0_3.2_1678877919575.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_pipeline_en_4.3.0_3.2_1678877919575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_drugs_greedy_pipeline", "en", "clinical/models")
text = '''DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_drugs_greedy_pipeline", "en", "clinical/models")
val text = "DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.drugs_greedy.pipeline").predict("""DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:----------------------------------|--------:|------:|:------------|-------------:|
| 0 | hydrocortisone tablets | 48 | 69 | DRUG | 0.9923 |
| 1 | 20 mg to 240 mg of hydrocortisone | 85 | 117 | DRUG | 0.7361 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_drugs_greedy_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Legal Titles And Subtitles Clause Binary Classifier
author: John Snow Labs
name: legclf_titles_and_subtitles_clause
date: 2023-01-29
tags: [en, legal, classification, titles, subtitles, clauses, titles_and_subtitles, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `titles-and-subtitles` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`titles-and-subtitles`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_titles_and_subtitles_clause_en_1.0.0_3.0_1674993674002.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_titles_and_subtitles_clause_en_1.0.0_3.0_1674993674002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[titles-and-subtitles]|
|[other]|
|[other]|
|[titles-and-subtitles]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_titles_and_subtitles_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label recision recall f1-score support
other 0.97 1.00 0.99 39
titles-and-subtitles 1.00 0.97 0.98 30
accuracy - - 0.99 69
macro-avg 0.99 0.98 0.99 69
weighted-avg 0.99 0.99 0.99 69
```
---
layout: model
title: Sentence Entity Resolver for ICD10-CM (Augmented)
author: John Snow Labs
name: sbiobertresolve_icd10cm_augmented
date: 2022-01-21
tags: [icd10cm, entity_resolution, clinical, en, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.1
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, it has been augmented with synonyms for making it more accurate.
## Predicted Entities
`ICD10CM Codes`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_3.0_1642756161477.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_3.0_1642756161477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(['PROBLEM'])
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver])
data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text")
results = nlpPipeline.fit(data_ner).transform(data_ner)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array('PROBLEM'))
val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10cm.augmented").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""")
```
## Results
```bash
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ner_chunk| entity|icd10cm_code| resolutions| all_codes|
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| gestational diabetes mellitus|PROBLEM| O2441|gestational diabetes mellitus:::postpartum gestational diabetes mel...| O2441:::O2443:::Z8632:::Z875:::O2431:::O2411:::O244:::O241:::O2481|
|subsequent type two diabetes mellitus|PROBLEM| O2411|pre-existing type 2 diabetes mellitus:::disorder associated with ty...|O2411:::E118:::E11:::E139:::E119:::E113:::E1144:::Z863:::Z8639:::E1...|
| T2DM|PROBLEM| E11|type 2 diabetes mellitus:::disorder associated with type 2 diabetes...|E11:::E118:::E119:::O2411:::E109:::E139:::E113:::E8881:::Z833:::D64...|
| HTG-induced pancreatitis|PROBLEM| K8520|alcohol-induced pancreatitis:::drug-induced acute pancreatitis:::he...|K8520:::K853:::K8590:::F102:::K852:::K859:::K8580:::K8591:::K858:::...|
| acute hepatitis|PROBLEM| K720|acute hepatitis:::acute hepatitis a:::acute infectious hepatitis:::...|K720:::B15:::B179:::B172:::Z0389:::B159:::B150:::B16:::K752:::K712:...|
| obesity|PROBLEM| E669|obesity:::abdominal obesity:::obese:::central obesity:::overweight ...|E669:::E668:::Z6841:::Q130:::E66:::E6601:::Z8639:::E349:::H3550:::Z...|
| a body mass index|PROBLEM| Z6841|finding of body mass index:::observation of body mass index:::mass ...|Z6841:::E669:::R229:::Z681:::R223:::R221:::Z68:::R222:::R220:::R418...|
| polyuria|PROBLEM| R35|polyuria:::nocturnal polyuria:::polyuric state:::polyuric state (di...|R35:::R3581:::R358:::E232:::R31:::R350:::R8299:::N401:::E723:::O048...|
| polydipsia|PROBLEM| R631|polydipsia:::psychogenic polydipsia:::primary polydipsia:::psychoge...|R631:::F6389:::E232:::F639:::O40:::G475:::M7989:::R632:::R061:::H53...|
| poor appetite|PROBLEM| R630|poor appetite:::poor feeding:::bad taste in mouth:::unpleasant tast...|R630:::P929:::R438:::R432:::E86:::R196:::F520:::Z724:::R0689:::Z768...|
| vomiting|PROBLEM| R111|vomiting:::intermittent vomiting:::vomiting symptoms:::periodic vom...| R111:::R11:::R1110:::G43A1:::P921:::P9209:::G43A:::R1113:::R110|
| a respiratory tract infection|PROBLEM| J988|respiratory tract infection:::upper respiratory tract infection:::b...|J988:::J069:::A499:::J22:::J209:::Z593:::T17:::J0410:::Z1383:::J189...|
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10cm_augmented|
|Compatibility:|Healthcare NLP 3.3.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icd10cm_code]|
|Language:|en|
|Size:|1.4 GB|
|Case sensitive:|false|
## Data Source
Trained on ICD10CM 2022 Codes dataset: https://www.cdc.gov/nchs/icd/icd10cm.htm
---
layout: model
title: Relation extraction between Drugs and ADE
author: John Snow Labs
name: re_ade_clinical
date: 2021-07-12
tags: [licensed, clinical, en, relation_extraction, ade]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 3.1.2
spark_version: 3.0
supported: true
annotator: RelationExtractionModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is capable of Relating Drugs and adverse reactions caused by them; It predicts if an adverse event is caused by a drug or not. `1` : Shows the adverse event and drug entities are related, `0` : Shows the adverse event and drug entities are not related.
## Predicted Entities
`0`, `1`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ADE/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/RE_ADE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_ade_clinical_en_3.1.2_3.0_1626104637779.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_ade_clinical_en_3.1.2_3.0_1626104637779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
In the table below, `re_ade_clinical` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated.
| RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS |
|:---------------:|:--------------:|:----------------:|------------------------------|
| re_ade_clinical | 0 1 | ner_ade_clinical | ["ade-drug", "drug-ade"] |
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel() \
.pretrained("ner_ade_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner_tags")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner_tags"]) \
.setOutputCol("ner_chunks")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"])\
.setOutputCol("pos_tags")
dependency_parser = sparknlp.annotators.DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentence", "pos_tags", "token"])\
.setOutputCol("dependencies")
re_model = RelationExtractionModel()\
.pretrained("re_ade_clinical", "en", 'clinical/models')\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setMaxSyntacticDistance(10)\
.setPredictionThreshold(0.1)\
.setRelationPairs(["ade-drug", "drug-ade"])\
.setRelationPairsCaseSensitive(False)
nlp_pipeline = Pipeline(stages=[documentAssembler,
sentenceDetector,
tokenizer,
words_embedder,
ner_tagger,
ner_converter,
pos_tagger,
dependency_parser,
re_model])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
text ="""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps. """
annotations = light_pipeline.fullAnnotate(text)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = NerDLModel()
.pretrained("ner_ade_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverter()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
val re_model = RelationExtractionModel()
.pretrained("re_ade_clinical", "en", 'clinical/models')
.setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies"))
.setOutputCol("relations")
.setMaxSyntacticDistance(3)
.setPredictionThreshold(0.5)
.setRelationPairs(Array("drug-ade", "ade-drug"))
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
words_embedder,
ner_tagger,
ner_converter,
pos_tagger,
dependency_parser,
re_model))
val data = Seq("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps. """).toDS.toDF("text")
val result = nlpPipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.adverse_drug_events.clinical").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps.""")
```
## Results
```bash
| relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence |
|---------:|:--------|--------------:|------------:|:----------|:--------|--------------:|------------:|:---------------|-----------:|
| 1 | DRUG | 12 | 18 | Lipitor | ADE | 52 | 65 | severe fatigue | 1 |
| 1 | DRUG | 97 | 105 | voltarene | ADE | 144 | 156 | muscle cramps | 0.997283 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|re_ade_clinical|
|Type:|re|
|Compatibility:|Healthcare NLP 3.1.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]|
|Output Labels:|[relations]|
|Language:|en|
## Data Source
This model is trained on custom data annotated by JSL.
## Benchmarking
```bash
label precision recall f1-score support
0 0.86 0.88 0.87 1787
1 0.92 0.90 0.91 2586
micro-avg 0.89 0.89 0.89 4373
macro-avg 0.89 0.89 0.89 4373
weighted-avg 0.89 0.89 0.89 4373
```
---
layout: model
title: English asr_model_4 TFWav2Vec2ForCTC from niclas
author: John Snow Labs
name: pipeline_asr_model_4
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_4` is a English model originally trained by niclas.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_model_4_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_4_en_4.2.0_3.0_1664098319002.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_4_en_4.2.0_3.0_1664098319002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_model_4', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_model_4", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_model_4|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English asr_Wav2Vec2_XLSR_Bengali_10500 TFWav2Vec2ForCTC from shoubhik
author: John Snow Labs
name: pipeline_asr_Wav2Vec2_XLSR_Bengali_10500
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_XLSR_Bengali_10500` is a English model originally trained by shoubhik.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Wav2Vec2_XLSR_Bengali_10500_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Wav2Vec2_XLSR_Bengali_10500_en_4.2.0_3.0_1664105201102.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Wav2Vec2_XLSR_Bengali_10500_en_4.2.0_3.0_1664105201102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_Wav2Vec2_XLSR_Bengali_10500', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_Wav2Vec2_XLSR_Bengali_10500", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_Wav2Vec2_XLSR_Bengali_10500|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|3.6 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Brazilian Portuguese NER for Laws (Base)
author: John Snow Labs
name: legner_br_base
date: 2022-09-27
tags: [pt, licensed]
task: Named Entity Recognition
language: pt
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Deep Learning Portuguese Named Entity Recognition model for the legal domain, trained using Base Bert Embeddings, and is able to predict the following entities:
- ORGANIZACAO (Organizations)
- JURISPRUDENCIA (Jurisprudence)
- PESSOA (Person)
- TEMPO (Time)
- LOCAL (Location)
- LEGISLACAO (Laws)
- O (Other)
You can find different versions of this model in Models Hub:
- With a Deep Learning architecture (non-transformer) and Base Embeddings;
- With a Deep Learning architecture (non-transformer) and Large Embeddings;
- With a Transformers Architecture and Base Embeddings;
- With a Transformers Architecture and Large Embeddings;
## Predicted Entities
`PESSOA`, `ORGANIZACAO`, `LEGISLACAO`, `JURISPRUDENCIA`, `TEMPO`, `LOCAL`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_LEGAL_PT/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_br_base_pt_1.0.0_3.0_1664276774137.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_br_base_pt_1.0.0_3.0_1664276774137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = nlp.Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
embeddings = nlp.BertEmbeddings.pretrained("bert_portuguese_base_cased", "pt")\
.setInputCols("document", "token") \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained('legner_br_base', 'pt', 'legal/models') \
.setInputCols(['document', 'token', 'embeddings']) \
.setOutputCol('ner')
ner_converter = nlp.NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('ner_chunk')
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter
])
example = spark.createDataFrame(pd.DataFrame({'text': ["""Mediante do exposto , com fundamento nos artigos 32 , i , e 33 , da lei 8.443/1992 , submetem-se os autos à consideração superior , com posterior encaminhamento ao ministério público junto ao tcu e ao gabinete do relator , propondo : a ) conhecer do recurso e , no mérito , negar-lhe provimento ; b ) comunicar ao recorrente , ao superior tribunal militar e ao tribunal regional federal da 2ª região , a fim de fornecer subsídios para os processos judiciais 2001.34.00.024796-9 e 2003.34.00.044227-3 ; e aos demais interessados a deliberação que vier a ser proferida por esta corte ” ."""]}))
result = pipeline.fit(example).transform(example)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_100h_by_vuiseng9", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_100h_by_vuiseng9", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_100h_by_vuiseng9|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|227.9 MB|
---
layout: model
title: Arabic BertForMaskedLM Base Cased model (from CAMeL-Lab)
author: John Snow Labs
name: bert_embeddings_base_arabic_camel_msa_eighth
date: 2022-12-02
tags: [ar, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ar
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabic-camelbert-msa-eighth` is a Arabic model originally trained by `CAMeL-Lab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_eighth_ar_4.2.4_3.0_1670016070157.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_eighth_ar_4.2.4_3.0_1670016070157.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa_eighth","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa_eighth","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_arabic_camel_msa_eighth|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|409.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-eighth
- https://arxiv.org/abs/2103.06678
- https://github.com/CAMeL-Lab/CAMeLBERT
- https://catalog.ldc.upenn.edu/LDC2011T11
- http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus
- https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian
- https://archive.org/details/arwiki-20190201
- https://oscar-corpus.com/
- https://github.com/google-research/bert
- https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297
- https://github.com/CAMeL-Lab/camel_tools
- https://github.com/CAMeL-Lab/CAMeLBERT
---
layout: model
title: Detect Cancer Genetics (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_bionlp
date: 2021-11-03
tags: [bertfortokenclassification, ner, bionlp, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.3.0
spark_version: 2.4
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts biological and genetics terms in cancer-related texts using pre-trained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP.
## Predicted Entities
`Amino_acid`, `Anatomical_system`, `Cancer`, `Cell`, `Cellular_component`, `Developing_anatomical_Structure`, `Gene_or_gene_product`, `Immaterial_anatomical_entity`, `Multi-tissue_structure`, `Organ`, `Organism`, `Organism_subdivision`, `Simple_chemical`, `Tissue`, `Organism_substance`, `Pathological_formation`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_en_3.3.0_2.4_1635952712612.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_en_3.3.0_2.4_1635952712612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_bionlp", "en", "clinical/models")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter])
p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))
test_sentence = """Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay."""
result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]})))
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_bionlp", "en", "clinical/models")
.setInputCols(Array("token", "document"))
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))
val data = Seq("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.bionlp").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_base_uncased_finetuned","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_base_uncased_finetuned","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_squad_base_uncased_finetuned|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/en/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English DistilBertForQuestionAnswering model (from datarpit)
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_natural_questions
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-natural-questions` is a English model originally trained by `datarpit`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_natural_questions_en_4.0.0_3.0_1654723994546.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_natural_questions_en_4.0.0_3.0_1654723994546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_natural_questions","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_natural_questions","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.base_uncased.by_datarpit").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_natural_questions|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/datarpit/distilbert-base-uncased-finetuned-natural-questions
---
layout: model
title: French CamemBert Embeddings (from Sebu)
author: John Snow Labs
name: camembert_embeddings_Sebu_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Sebu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Sebu_generic_model_fr_3.4.4_3.0_1653986850148.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Sebu_generic_model_fr_3.4.4_3.0_1653986850148.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Sebu_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Sebu_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_Sebu_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Sebu/dummy-model
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Thitaree)
author: John Snow Labs
name: distilbert_qa_thitaree_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Thitaree`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_thitaree_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769425321.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_thitaree_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769425321.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_thitaree_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_thitaree_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_thitaree_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Thitaree/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations)
author: John Snow Labs
name: legner_mapa
date: 2023-04-27
tags: [pt, licensed, ner, legal, mapa]
task: Named Entity Recognition
language: pt
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Portuguese` documents.
## Predicted Entities
`ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_pt_1.0.0_3.0_1682608680085.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_pt_1.0.0_3.0_1682608680085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_pt_cased", "pt")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "pt", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""Nos termos dos Decretos da Garda Síochána (6), só pode ser admitido como estagiário para integrar a força policial nacional quem tiver pelo menos 18 anos, mas menos de 35 anos de idade, no primeiro dia do mês em que tenha sido publicado pela primeira vez, num jornal nacional, o anúncio da vaga a que o recrutamento respeita."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+-----------------------+------------+
|chunk |ner_label |
+-----------------------+------------+
|Garda Síochána |ORGANISATION|
|força policial nacional|ORGANISATION|
|18 anos |AMOUNT |
|35 anos |AMOUNT |
+-----------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_mapa|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|pt|
|Size:|1.4 MB|
## References
The dataset is available [here](https://huggingface.co/datasets/joelito/mapa).
## Benchmarking
```bash
label precision recall f1-score support
ADDRESS 0.91 0.91 0.91 23
AMOUNT 1.00 0.83 0.91 6
DATE 1.00 0.95 0.97 61
ORGANISATION 0.85 0.77 0.81 30
PERSON 0.88 0.91 0.89 65
macro-avg 0.92 0.90 0.91 185
macro-avg 0.93 0.87 0.90 185
weighted-avg 0.92 0.90 0.91 185
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from nlpunibo) Config1
author: John Snow Labs
name: distilbert_qa_base_config1
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_config1` is a English model originally trained by `nlpunibo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config1_en_4.0.0_3.0_1654727786120.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config1_en_4.0.0_3.0_1654727786120.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.base_config1.by_nlpunibo").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_config1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/nlpunibo/distilbert_base_config1
---
layout: model
title: Translate English to Western Malayo-Polynesian languages Pipeline
author: John Snow Labs
name: translate_en_pqw
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, pqw, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `pqw`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_pqw_xx_2.7.0_2.4_1609688594063.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_pqw_xx_2.7.0_2.4_1609688594063.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_pqw", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_pqw", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.pqw').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_pqw|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Chunk Entity Resolver RxNorm-scdc
author: John Snow Labs
name: chunkresolve_rxnorm_in_healthcare
date: 2021-04-16
tags: [entity_resolution, clinical, licensed, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to RxNorm codes using chunk embeddings (augmented with synonyms, four times richer than previous resolver).
## Predicted Entities
RxNorm codes
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_RXNORM/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_in_healthcare_en_3.0.0_3.0_1618605195699.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_in_healthcare_en_3.0.0_3.0_1618605195699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_in_healthcare","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity")
pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, resolver])
data = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."]]).toDF("text")
model = pipeline.fit(data)
results = model.transform(data)
...
```
```scala
...
val resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_in_healthcare","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity")
val pipeline = new Pipeline().setStages(Array(document_assembler, sbert_embedder, resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
| chunk| entity| target_text| code|confidence|
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
| metformin|TREATMENT|metFORMIN compounding powder:::Metformin Hydrochloride Powder:::metFORMIN 500 mg oral tablet:::me...| 601021| 0.2364|
| glipizide|TREATMENT|Glipizide Powder:::Glipizide Crystal:::Glipizide Tablets:::glipiZIDE 5 mg oral tablet:::glipiZIDE...| 241604| 0.3647|
|dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|Ezetimibe and Atorvastatin Tablets:::Amlodipine and Atorvastatin Tablets:::Atorvastatin Calcium T...|1422084| 0.3407|
| dapagliflozin|TREATMENT|Dapagliflozin Tablets:::dapagliflozin 5 mg oral tablet:::dapagliflozin 10 mg oral tablet:::Dapagl...|1488568| 0.7070|
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|chunkresolve_rxnorm_in_healthcare|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[token, chunk_embeddings]|
|Output Labels:|[rxnorm_code]|
|Language:|en|
---
layout: model
title: Finnish asr_wav2vec2_xlsr_300m_finnish TFWav2Vec2ForCTC from aapot
author: John Snow Labs
name: asr_wav2vec2_xlsr_300m_finnish
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_300m_finnish` is a Finnish model originally trained by aapot.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_300m_finnish_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_300m_finnish_fi_4.2.0_3.0_1664023005420.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_300m_finnish_fi_4.2.0_3.0_1664023005420.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xlsr_300m_finnish", "fi")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xlsr_300m_finnish", "fi")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xlsr_300m_finnish|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|fi|
|Size:|1.2 GB|
---
layout: model
title: English DistilBertForQuestionAnswering model (from graviraja)
author: John Snow Labs
name: distilbert_qa_graviraja_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `graviraja`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_graviraja_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725306055.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_graviraja_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725306055.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_graviraja_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_graviraja_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_graviraja").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_graviraja_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/graviraja/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Indonesian RoBERTa Embeddings (from w11wo)
author: John Snow Labs
name: roberta_embeddings_indo_roberta_small
date: 2022-04-14
tags: [roberta, embeddings, id, open_source]
task: Embeddings
language: id
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indo-roberta-small` is a Indonesian model orginally trained by `w11wo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indo_roberta_small_id_3.4.2_3.0_1649948731693.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indo_roberta_small_id_3.4.2_3.0_1649948731693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indo_roberta_small","id") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Saya suka percikan NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indo_roberta_small","id")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Saya suka percikan NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("id.embed.indo_roberta_small").predict("""Saya suka percikan NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_indo_roberta_small|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|id|
|Size:|314.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/w11wo/indo-roberta-small
- https://arxiv.org/abs/1907.11692
- https://github.com/sgugger
- https://w11wo.github.io/
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from Moussab)
author: John Snow Labs
name: distilbert_qa_base_cased_led_squad_orkg_which_1e_04
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-orkg-which-1e-04` is a English model originally trained by `Moussab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_which_1e_04_en_4.3.0_3.0_1672766889512.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_which_1e_04_en_4.3.0_3.0_1672766889512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_which_1e_04","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_which_1e_04","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_cased_led_squad_orkg_which_1e_04|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Moussab/distilbert-base-cased-distilled-squad-orkg-which-1e-04
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_by_anan0329 TFWav2Vec2ForCTC from anan0329
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_anan0329` is a English model originally trained by anan0329.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329_en_4.2.0_3.0_1664114693280.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329_en_4.2.0_3.0_1664114693280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Chinese Bert Embeddings (Base)
author: John Snow Labs
name: bert_embeddings_jdt_fin_roberta_wwm
date: 2022-04-11
tags: [bert, embeddings, zh, open_source]
task: Embeddings
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `jdt-fin-roberta-wwm` is a Chinese model orginally trained by `wangfan`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_jdt_fin_roberta_wwm_zh_3.4.2_3.0_1649669984329.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_jdt_fin_roberta_wwm_zh_3.4.2_3.0_1649669984329.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_jdt_fin_roberta_wwm","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_jdt_fin_roberta_wwm","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.jdt_fin_roberta_wwm").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_jdt_fin_roberta_wwm|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|383.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/wangfan/jdt-fin-roberta-wwm
- https://3.cn/103c-hwSS
- https://3.cn/103c-izpe
---
layout: model
title: English image_classifier_vit_llama_alpaca_guanaco_vicuna ViTForImageClassification from osanseviero
author: John Snow Labs
name: image_classifier_vit_llama_alpaca_guanaco_vicuna
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_llama_alpaca_guanaco_vicuna` is a English model originally trained by osanseviero.
## Predicted Entities
`alpaca`, `guanaco`, `llama`, `vicuna`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_alpaca_guanaco_vicuna_en_4.1.0_3.0_1660166270042.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_alpaca_guanaco_vicuna_en_4.1.0_3.0_1660166270042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_llama_alpaca_guanaco_vicuna", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_llama_alpaca_guanaco_vicuna", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_llama_alpaca_guanaco_vicuna|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Legal Sublease Agreement Document Classifier (Longformer)
author: John Snow Labs
name: legclf_sublease_agreement
date: 2022-11-10
tags: [en, legal, classification, agreement, sublease, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_sublease_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `sublease-agreement` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`sublease-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sublease_agreement_en_1.0.0_3.0_1668117647287.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sublease_agreement_en_1.0.0_3.0_1668117647287.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[sublease-agreement]|
|[other]|
|[other]|
|[sublease-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_sublease_agreement|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.2 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.99 1.00 0.99 66
sublease-agreement 1.00 0.97 0.99 35
accuracy - - 0.99 101
macro-avg 0.99 0.99 0.99 101
weighted-avg 0.99 0.99 0.99 101
```
---
layout: model
title: German Financial Bert Word Embeddings
author: John Snow Labs
name: bert_embeddings_german_financial_statements_bert
date: 2022-04-11
tags: [bert, embeddings, de, open_source]
task: Embeddings
language: de
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Financial Bert Word Embeddings model, trained on German Financial Statements. Uploaded to Hugging Face, adapted and imported into Spark NLP. `german-financial-statements-bert` is a German model orginally trained by `fabianrausch`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_german_financial_statements_bert_de_3.4.2_3.0_1649676227862.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_german_financial_statements_bert_de_3.4.2_3.0_1649676227862.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_german_financial_statements_bert","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_german_financial_statements_bert","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Funken NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.embed.german_financial_statements_bert").predict("""Ich liebe Funken NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_german_financial_statements_bert|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|409.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/fabianrausch/german-financial-statements-bert
---
layout: model
title: Sentence Entity Resolver for UMLS CUI Codes (Clinical Drug)
author: John Snow Labs
name: sbiobertresolve_umls_clinical_drugs
date: 2022-07-05
tags: [entity_resolution, licensed, clinical, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities to UMLS CUI codes. It is trained on 2022AA UMLS dataset. The complete dataset has 127 different categories, and this model is trained on the Clinical Drug category using sbiobert_base_cased_mli embeddings.
## Predicted Entities
`Predicts UMLS codes for Clinical Drug medical concepts`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_clinical_drugs_en_4.0.0_3.0_1657039242193.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_clinical_drugs_en_4.0.0_3.0_1657039242193.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("clinical_ner")
ner_model_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "clinical_ner"])\
.setOutputCol("ner_chunk")
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_umls_clinical_drugs","en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
pipeline = Pipeline(stages = [document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunk2doc, sbert_embedder, resolver])
data = spark.createDataFrame([["""She was immediately given hydrogen peroxide 30 mg to treat the infection on her leg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg."""]]).toDF("text")
results = pipeline.fit(data).transform(data)
```
```scala
...
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel
.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("clinical_ner")
val ner_model_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "clinical_ner"))
.setOutputCol("ner_chunk")
val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli", "en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
.setCaseSensitive(False)
val resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_umls_clinical_drugs", "en", "clinical/models")
.setInputCols(Array("ner_chunk_doc", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunk2doc, sbert_embedder, resolver))
val data = Seq("She was immediately given hydrogen peroxide 30 mg to treat the infection on her leg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.").toDF("text")
val res = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.umls_clinical_drugs").predict("""She was immediately given hydrogen peroxide 30 mg to treat the infection on her leg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.""")
```
## Results
```bash
| | chunk | code | code_description | all_k_code_desc | all_k_codes |
|---:|:------------------------------|:---------|:---------------------------|:-------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | hydrogen peroxide 30 mg | C1126248 | hydrogen peroxide 30 mg/ml | ['C1126248', 'C0304655', 'C1605252', 'C0304656', 'C1154260'] | ['hydrogen peroxide 30 mg/ml', 'hydrogen peroxide solution 30%', 'hydrogen peroxide 30 mg/ml [proxacol]', 'hydrogen peroxide 30 mg/ml cutaneous solution', 'benzoyl peroxide 30 mg/ml'] |
| 1 | Neosporin Cream | C0132149 | neosporin cream | ['C0132149', 'C0358174', 'C0357999', 'C0307085', 'C0698810'] | ['neosporin cream', 'nystan cream', 'nystadermal cream', 'nupercainal cream', 'nystaform cream'] |
| 2 | magnesium hydroxide 100mg/1ml | C1134402 | magnesium hydroxide 100 mg | ['C1134402', 'C1126785', 'C4317023', 'C4051486', 'C4047137'] | ['magnesium hydroxide 100 mg', 'magnesium hydroxide 100 mg/ml', 'magnesium sulphate 100mg/ml injection', 'magnesium sulfate 100 mg', 'magnesium sulfate 100 mg/ml'] |
| 3 | metformin 1000 mg | C0987664 | metformin 1000 mg | ['C0987664', 'C2719784', 'C0978482', 'C2719786', 'C4282269'] | ['metformin 1000 mg', 'metformin hydrochloride 1000 mg', 'metformin hcl 1000mg tab', 'metformin hydrochloride 1000 mg [fortamet]', 'metformin hcl 1000mg sa tab'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_umls_clinical_drugs|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[name]|
|Language:|en|
|Size:|2.5 GB|
|Case sensitive:|false|
## References
Trained on 2022AA UMLS dataset’s Clinical Drug category. https://www.nlm.nih.gov/research/umls/index.html
---
layout: model
title: Legal Other Definitional Provisions Clause Binary Classifier
author: John Snow Labs
name: legclf_other_definitional_provisions_clause
date: 2023-01-29
tags: [en, legal, classification, other, definitional, provisions, clauses, other_definitional_provisions, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `other-definitional-provisions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other-definitional-provisions`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_other_definitional_provisions_clause_en_1.0.0_3.0_1674993355901.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_other_definitional_provisions_clause_en_1.0.0_3.0_1674993355901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[other-definitional-provisions]|
|[other]|
|[other]|
|[other-definitional-provisions]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_other_definitional_provisions_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.97 1.00 0.98 30
other-definitional-provisions 1.00 0.95 0.98 22
accuracy - - 0.98 52
macro-avg 0.98 0.98 0.98 52
weighted-avg 0.98 0.98 0.98 52
```
---
layout: model
title: Pipeline for Detect Medication
author: John Snow Labs
name: ner_medication_pipeline
date: 2023-06-13
tags: [ner, en, licensed]
task: Pipeline Healthcare
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pretrained pipeline to detect medication entities. It was built on the top of `ner_posology_greedy` model and also augmented with the drug names mentioned in UK and US drugbank datasets.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_medication_pipeline_en_4.4.4_3.2_1686665836067.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_medication_pipeline_en_4.4.4_3.2_1686665836067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
ner_medication_pipeline = PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models")
text = """The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."""
result = ner_medication_pipeline.fullAnnotate([text])
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val ner_medication_pipeline = new PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models")
val result = ner_medication_pipeline.fullAnnotate("The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."")
```
{:.nlu-block}
```python
| ner_chunk | entity |
|:-------------------|:---------|
| metformin 1000 MG | DRUG |
| glipizide 2.5 MG | DRUG |
| Fragmin 5000 units | DRUG |
| Xenaderm | DRUG |
| OxyContin 30 mg | DRUG |
```
{:.model-param}
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
ner_medication_pipeline = PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models")
text = """The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."""
result = ner_medication_pipeline.fullAnnotate([text])
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val ner_medication_pipeline = new PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models")
val result = ner_medication_pipeline.fullAnnotate("The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."")
```
{:.nlu-block}
```python
| ner_chunk | entity |
|:-------------------|:---------|
| metformin 1000 MG | DRUG |
| glipizide 2.5 MG | DRUG |
| Fragmin 5000 units | DRUG |
| Xenaderm | DRUG |
| OxyContin 30 mg | DRUG |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_medication_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
- TextMatcherModel
- ChunkMergeModel
- Finisher
---
layout: model
title: Translate English to Finnish Pipeline
author: John Snow Labs
name: translate_en_fi
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, fi, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `fi`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_fi_xx_2.7.0_2.4_1609689441892.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_fi_xx_2.7.0_2.4_1609689441892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_fi", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_fi", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.fi').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_fi|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Stop Words Cleaner for Greek
author: John Snow Labs
name: stopwords_el
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: el
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, el]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_el_el_2.5.4_2.4_1594742437880.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_el_el_2.5.4_2.4_1594742437880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_el", "el") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_el", "el")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής."""]
stopword_df = nlu.load('el.stopwords').predict(text)
stopword_df[["cleanTokens"]]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=4, result='Εκτός', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=6, end=8, result='από', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=13, end=15, result='ότι', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=17, end=21, result='είναι', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=25, end=32, result='βασιλιάς', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_el|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|el|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: Financial Finbert Sentiment Analysis (DistilRoBerta)
author: John Snow Labs
name: finclf_distilroberta_sentiment_analysis
date: 2022-08-09
tags: [en, finance, sentiment, classification, sentiment_analysis, licensed]
task: Sentiment Analysis
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the DistilRoBerta language model in the finance domain, using a financial corpus and thereby fine-tuning it for financial sentiment classification. Financial PhraseBank by Malo et al. (2014) and in-house JSL documents and annotations have been used for fine-tuning.
## Predicted Entities
`positive`, `negative`, `neutral`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_FINANCE/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_distilroberta_sentiment_analysis_en_1.0.0_3.2_1660055192412.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_distilroberta_sentiment_analysis_en_1.0.0_3.2_1660055192412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
classifier = nlp.RoBertaForSequenceClassification.pretrained("finclf_distilroberta_sentiment_analysis","en", "finance/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
nlpPipeline = nlp.Pipeline(
stages = [
documentAssembler,
tokenizer,
classifier])
# couple of simple examples
example = spark.createDataFrame([["Stocks rallied and the British pound gained."]]).toDF("text")
result = nlpPipeline.fit(example).transform(example)
# result is a DataFrame
result.select("text", "class.result").show()
```
## Results
```bash
+--------------------+----------+
| text| result|
+--------------------+----------+
|Stocks rallied an...|[positive]|
+--------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_distilroberta_sentiment_analysis|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|309.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
In-house financial documents and Financial PhraseBank by Malo et al. (2014)
## Benchmarking
```bash
label precision recall f1-score support
positive 0.77 0.88 0.81 253
negative 0.86 0.85 0.88 133
neutral 0.93 0.86 0.90 584
accuracy - - 0.86 970
macro-avg 0.85 0.86 0.85 970
weighted-avg 0.87 0.86 0.87 970
```
---
layout: model
title: Portuguese BertForTokenClassification Cased model (from pucpr)
author: John Snow Labs
name: bert_token_classifier_clinicalnerpt_disease
date: 2022-11-30
tags: [pt, open_source, bert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: pt
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `clinicalnerpt-disease` is a Portuguese model originally trained by `pucpr`.
## Predicted Entities
`DiseaseOrSyndrome`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_disease_pt_4.2.4_3.0_1669822418241.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_disease_pt_4.2.4_3.0_1669822418241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_disease","pt") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_disease","pt")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_clinicalnerpt_disease|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|pt|
|Size:|665.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/pucpr/clinicalnerpt-disease
- https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/
- https://github.com/HAILab-PUCPR/SemClinBr
- https://github.com/HAILab-PUCPR/BioBERTpt
---
layout: model
title: Stop Words Cleaner for Marathi
author: John Snow Labs
name: stopwords_mr
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: mr
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, mr]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_mr_mr_2.5.4_2.4_1594742439994.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_mr_mr_2.5.4_2.4_1594742439994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_mr", "mr") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_mr", "mr")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे."""]
stopword_df = nlu.load('mr.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=7, result='उत्तरेचा', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=9, end=12, result='राजा', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=14, end=29, result='होण्याव्यतिरिक्त', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=30, end=30, result=',', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=32, end=34, result='जॉन', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_mr|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|mr|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: Translate Icelandic to English Pipeline
author: John Snow Labs
name: translate_is_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, is, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `is`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_is_en_xx_2.7.0_2.4_1609690970413.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_is_en_xx_2.7.0_2.4_1609690970413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_is_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_is_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.is.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_is_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Translate English to Xhosa Pipeline
author: John Snow Labs
name: translate_en_xh
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, xh, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `xh`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_xh_xx_2.7.0_2.4_1609689615747.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_xh_xx_2.7.0_2.4_1609689615747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_xh", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_xh", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.xh').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_xh|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from zhenyueyu)
author: John Snow Labs
name: distilbert_qa_zhenyueyu_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `zhenyueyu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_zhenyueyu_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773369365.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_zhenyueyu_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773369365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zhenyueyu_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zhenyueyu_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_zhenyueyu_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/zhenyueyu/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Abkhazian asr_xls_test TFWav2Vec2ForCTC from pere
author: John Snow Labs
name: pipeline_asr_xls_test
date: 2022-09-24
tags: [wav2vec2, ab, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: ab
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_test` is a Abkhazian model originally trained by pere.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_test_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_test_ab_4.2.0_3.0_1664020711203.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_test_ab_4.2.0_3.0_1664020711203.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_xls_test', lang = 'ab')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_xls_test", lang = "ab")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_xls_test|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|ab|
|Size:|452.5 KB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Arabic Bert Embeddings (Base)
author: John Snow Labs
name: bert_embeddings_bert_base_qarib
date: 2022-04-11
tags: [bert, embeddings, ar, open_source]
task: Embeddings
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-qarib` is a Arabic model orginally trained by `qarib`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_qarib_ar_3.4.2_3.0_1649677790858.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_qarib_ar_3.4.2_3.0_1649677790858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_qarib","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_qarib","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.bert_base_qarib").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_qarib|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|506.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/qarib/bert-base-qarib
- http://opus.nlpl.eu/
- https://github.com/qcri/QARIB/Training_QARiB.md
- https://github.com/qcri/QARIB/Using_QARiB.md
---
layout: model
title: Catalan, Valencian asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala TFWav2Vec2ForCTC from softcatala
author: John Snow Labs
name: asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala
date: 2022-09-24
tags: [wav2vec2, ca, audio, open_source, asr]
task: Automatic Speech Recognition
language: ca
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala` is a Catalan, Valencian model originally trained by softcatala.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_ca_4.2.0_3.0_1664037065825.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_ca_4.2.0_3.0_1664037065825.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala", "ca")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala", "ca")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|ca|
|Size:|1.2 GB|
---
layout: model
title: Sango asr_wav2vec2_large_xlsr_53_swiss_german TFWav2Vec2ForCTC from Yves
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_swiss_german
date: 2022-09-24
tags: [wav2vec2, sg, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: sg
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_swiss_german` is a Sango model originally trained by Yves.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_swiss_german_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_swiss_german_sg_4.2.0_3.0_1664022719221.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_swiss_german_sg_4.2.0_3.0_1664022719221.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_swiss_german', lang = 'sg')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_swiss_german", lang = "sg")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_swiss_german|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|sg|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Danish XlmRoBertaForQuestionAnswering (from saattrupdan)
author: John Snow Labs
name: xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan
date: 2022-06-24
tags: [da, open_source, question_answering, xlmroberta]
task: Question Answering
language: da
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmr-base-texas-squad-da` is a Danish model originally trained by `saattrupdan`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan_da_4.0.0_3.0_1656062061104.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan_da_4.0.0_3.0_1656062061104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan","da") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan","da")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("da.answer_question.squad.xlmr_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|da|
|Size:|878.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/saattrupdan/xlmr-base-texas-squad-da
---
layout: model
title: Legal Construction Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_construction_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, construction, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Construction` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Construction`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_construction_bert_en_1.0.0_3.0_1678050533068.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_construction_bert_en_1.0.0_3.0_1678050533068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Construction]|
|[Other]|
|[Other]|
|[Construction]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_construction_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Construction 0.84 0.83 0.84 46
Other 0.88 0.90 0.89 67
accuracy - - 0.87 113
macro-avg 0.86 0.86 0.86 113
weighted-avg 0.87 0.87 0.87 113
```
---
layout: model
title: English BertForTokenClassification Base Uncased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4_Original-BiomedNLP-PubMedBERT-base-uncased-abstract` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract_en_4.0.0_3.0_1657109212870.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract_en_4.0.0_3.0_1657109212870.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|408.5 MB|
|Case sensitive:|false|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4_Original-BiomedNLP-PubMedBERT-base-uncased-abstract
---
layout: model
title: Multilingual DistilBertForTokenClassification Base Cased model (from mrm8488)
author: John Snow Labs
name: distilbert_ner_base_multi_cased_finetuned_typo_detection
date: 2022-07-21
tags: [open_source, distilbert, ner, typo, multilingual, xx]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT NER model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multi-cased-finetuned-typo-detection` is a Multilingual model originally trained by `mrm8488`.
## Predicted Entities
`ok`, `typo`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_base_multi_cased_finetuned_typo_detection_xx_4.0.0_3.0_1658399913400.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_base_multi_cased_finetuned_typo_detection_xx_4.0.0_3.0_1658399913400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
ner = DistilBertForTokenClassification.pretrained("distilbert_ner_base_multi_cased_finetuned_typo_detection","xx") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, ner])
data = spark.createDataFrame([["PUT YOUR STRING HERE."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val ner = DistilBertForTokenClassification.pretrained("distilbert_ner_base_multi_cased_finetuned_typo_detection","xx")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, ner))
val data = Seq("PUT YOUR STRING HERE.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_ner_base_multi_cased_finetuned_typo_detection|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|505.8 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
https://huggingface.co/mrm8488/distilbert-base-multi-cased-finetuned-typo-detection
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6_en_4.0.0_3.0_1655733561065.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6_en_4.0.0_3.0_1655733561065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_64d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|419.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-6
---
layout: model
title: Multilingual DistilBertForQuestionAnswering Base Cased model (from monakth)
author: John Snow Labs
name: distilbert_qa_base_cased_squadv2
date: 2023-01-03
tags: [xx, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: xx
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
recommended: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multilingual-cased-squadv2` is a Multilingual model originally trained by `monakth`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_squadv2_xx_4.3.0_3.0_1672767315694.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_squadv2_xx_4.3.0_3.0_1672767315694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_squadv2","xx")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_squadv2","xx")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_cased_squadv2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|xx|
|Size:|505.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/monakth/distilbert-base-multilingual-cased-squadv2
---
layout: model
title: Fast Neural Machine Translation Model from Pijin to English
author: John Snow Labs
name: opus_mt_pis_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pis, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `pis`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_pis_en_xx_2.7.0_2.4_1609163443618.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_pis_en_xx_2.7.0_2.4_1609163443618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_pis_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_pis_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.pis.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_pis_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Word2Vec Embeddings in Tagalog (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, tl, open_source]
task: Embeddings
language: tl
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tl_3.4.1_3.0_1647461421317.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tl_3.4.1_3.0_1647461421317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Gustung-gusto ko ang Spark NLP."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Gustung-gusto ko ang Spark NLP.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("tl.embed.w2v_cc_300d").predict("""Gustung-gusto ko ang Spark NLP.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|tl|
|Size:|416.3 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Legal Employment Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_employment_agreement_bert
date: 2022-11-24
tags: [en, legal, classification, agreement, employment, licensed, bert]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_employment_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `employment-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`employment-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employment_agreement_bert_en_1.0.0_3.0_1669310901974.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employment_agreement_bert_en_1.0.0_3.0_1669310901974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[employment-agreement]|
|[other]|
|[other]|
|[employment-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_employment_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
employment-agreement 0.96 0.90 0.93 29
other 0.96 0.99 0.98 82
accuracy - - 0.96 111
macro-avg 0.96 0.94 0.95 111
weighted-avg 0.96 0.96 0.96 111
```
---
layout: model
title: Fast Neural Machine Translation Model from Latvian to English
author: John Snow Labs
name: opus_mt_lv_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, lv, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `lv`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_lv_en_xx_2.7.0_2.4_1609163952807.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_lv_en_xx_2.7.0_2.4_1609163952807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_lv_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_lv_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.lv.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_lv_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForSequenceClassification Cased model (from Kaveh8)
author: John Snow Labs
name: roberta_classifier_autonlp_imdb_rating_625417974
date: 2022-12-09
tags: [en, open_source, roberta, sequence_classification, classification, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-imdb_rating-625417974` is a English model originally trained by `Kaveh8`.
## Predicted Entities
`1`, `4`, `3`, `2`, `5`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_imdb_rating_625417974_en_4.2.4_3.0_1670622586146.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_imdb_rating_625417974_en_4.2.4_3.0_1670622586146.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_imdb_rating_625417974","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier])
data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_imdb_rating_625417974","en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier))
val data = Seq("I love you!").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_classifier_autonlp_imdb_rating_625417974|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|428.0 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Kaveh8/autonlp-imdb_rating-625417974
---
layout: model
title: English asr_wav2vec2_large_xlsr_coraa_portuguese_cv8 TFWav2Vec2ForCTC from lgris
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_coraa_portuguese_cv8` is a English model originally trained by lgris.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8_en_4.2.0_3.0_1664043408372.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8_en_4.2.0_3.0_1664043408372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Chunk Resolver (Cpt Clinical)
author: John Snow Labs
name: chunkresolve_cpt_clinical
class: ChunkEntityResolverModel
language: en
nav_key: models
repository: clinical/models
date: 2020-04-21
task: Entity Resolution
edition: Healthcare NLP 2.4.2
spark_version: 2.4
tags: [clinical,licensed,entity_resolution,en]
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance.
## Predicted Entities
CPT Codes and their normalized definition with `clinical_embeddings`.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_cpt_clinical_en_2.4.5_2.4_1587491373378.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_cpt_clinical_en_2.4.5_2.4_1587491373378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
...
cpt_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_cpt_clinical","en","clinical/models")\
.setInputCols("token","chunk_embeddings")\
.setOutputCol("entity")
pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, cpt_resolver])
model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text"))
results = model.transform(data)
```
```scala
...
val cpt_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_cpt_clinical","en","clinical/models")
.setInputCols(Array("token","chunk_embeddings"))
.setOutputCol("resolution")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, cpt_resolver))
val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
chunk entity cpt_description cpt_code
0 a cold, cough PROBLEM Thoracoscopy, surgical; with removal of a sing... 32669
1 runny nose PROBLEM Unlisted procedure, larynx 31599
2 fever PROBLEM Cesarean delivery only; 59514
3 difficulty breathing PROBLEM Repair, laceration of diaphragm, any approach 39501
4 her cough PROBLEM Exploration for postoperative hemorrhage, thro... 35840
5 physical exam TEST Cesarean delivery only; including postpartum care 59515
6 fairly congested PROBLEM Pyelotomy; with drainage, pyelostomy 50125
7 Amoxil TREATMENT Cholecystoenterostomy; with gastroenterostomy 47721
8 Aldex TREATMENT Laparoscopy, surgical; with omentopexy (omenta... 49326
9 difficulty breathing PROBLEM Repair, laceration of diaphragm, any approach 39501
10 more congested PROBLEM for section of 1 or more cranial nerves 61460
11 trouble sleeping PROBLEM Repair, laceration of diaphragm, any approach 39501
12 congestion PROBLEM Repair, laceration of diaphragm, any approach 39501
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|---------------------------|
| Name: | chunkresolve_cpt_clinical |
| Type: | ChunkEntityResolverModel |
| Compatibility: | Spark NLP 2.4.2+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | token, chunk_embeddings |
|Output labels: | entity |
| Language: | en |
| Case sensitive: | True |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on Current Procedural Terminology dataset.
---
layout: model
title: Pipeline to Detect details of cellular structures (biobert)
author: John Snow Labs
name: ner_cellular_biobert_pipeline
date: 2023-03-20
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_cellular_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_cellular_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_biobert_pipeline_en_4.3.0_3.2_1679314449983.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_biobert_pipeline_en_4.3.0_3.2_1679314449983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_cellular_biobert_pipeline", "en", "clinical/models")
text = '''Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_cellular_biobert_pipeline", "en", "clinical/models")
val text = "Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.cellular_biobert.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:--------------------------------------------|--------:|------:|:------------|-------------:|
| 0 | intracellular signaling proteins | 27 | 58 | protein | 0.673333 |
| 1 | human T-cell leukemia virus type 1 promoter | 130 | 172 | DNA | 0.426171 |
| 2 | Tax | 186 | 188 | protein | 0.779 |
| 3 | Tax-responsive element 1 | 193 | 216 | DNA | 0.756933 |
| 4 | cyclic AMP-responsive members | 237 | 265 | protein | 0.629333 |
| 5 | CREB/ATF family | 274 | 288 | protein | 0.8499 |
| 6 | transcription factors | 293 | 313 | protein | 0.78165 |
| 7 | Tax | 389 | 391 | protein | 0.8463 |
| 8 | Tax-responsive element 1 | 431 | 454 | DNA | 0.713067 |
| 9 | TRE-1 | 457 | 461 | DNA | 0.9983 |
| 10 | lacZ gene | 582 | 590 | DNA | 0.7018 |
| 11 | CYC1 promoter | 617 | 629 | DNA | 0.81865 |
| 12 | TRE-1 | 663 | 667 | DNA | 0.9967 |
| 13 | cyclic AMP response element-binding protein | 695 | 737 | protein | 0.51984 |
| 14 | CREB | 740 | 743 | protein | 0.9708 |
| 15 | CREB | 749 | 752 | protein | 0.8875 |
| 16 | GAL4 activation domain | 767 | 788 | protein | 0.578633 |
| 17 | GAD | 791 | 793 | protein | 0.6432 |
| 18 | reporter gene | 848 | 860 | DNA | 0.61005 |
| 19 | Tax | 863 | 865 | protein | 0.99 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_cellular_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.1 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Sentence Entity Resolver for ICD10-PCS (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_icd10pcs
date: 2021-05-16
tags: [entity_resolution, clinical, licensed, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.4
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD10-PCS codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements.
## Predicted Entities
Predicts ICD10-PCS Codes and their normalized definitions.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10pcs_en_3.0.4_3.0_1621189710474.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10pcs_en_3.0.4_3.0_1621189710474.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_icd10pcs``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Procedure``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
icd10pcs_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icd10pcs","en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10pcs_resolver])
data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val icd10pcs_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icd10pcs","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10pcs_resolver))
val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10pcs").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""")
```
## Results
```bash
+--------------------+-----+---+---------+-------+----------+--------------------+--------------------+
| chunk|begin|end| entity| code|confidence| resolutions| codes|
+--------------------+-----+---+---------+-------+----------+--------------------+--------------------+
| hypertension| 68| 79| PROBLEM|DWY18ZZ| 0.0626|Hyperthermia of H...|DWY18ZZ:::6A3Z1ZZ...|
|chronic renal ins...| 83|109| PROBLEM|DTY17ZZ| 0.0722|Contact Radiation...|DTY17ZZ:::04593ZZ...|
| COPD| 113|116| PROBLEM|2W04X7Z| 0.0765|Change Intermitte...|2W04X7Z:::0J063ZZ...|
| gastritis| 120|128| PROBLEM|04723Z6| 0.0826|Dilation of Gastr...|04723Z6:::04724Z6...|
| TIA| 136|138| PROBLEM|00F5XZZ| 0.1074|Fragmentation in ...|00F5XZZ:::00F53ZZ...|
|a non-ST elevatio...| 182|202| PROBLEM|B307ZZZ| 0.0750|Plain Radiography...|B307ZZZ:::2W59X3Z...|
|Guaiac positive s...| 208|229| PROBLEM|3E1G38Z| 0.0886|Irrigation of Upp...|3E1G38Z:::3E1G38X...|
|cardiac catheteri...| 295|317| TEST|4A0234Z| 0.0783|Measurement of Ca...|4A0234Z:::4A02X4A...|
| PTCA| 324|327|TREATMENT|03SG3ZZ| 0.0507|Reposition Intrac...|03SG3ZZ:::0GCQ3ZZ...|
| mid LAD lesion| 332|345| PROBLEM|02H73DZ| 0.0490|Insertion of Intr...|02H73DZ:::02163Z7...|
+--------------------+-----+---+---------+-------+----------+--------------------+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10pcs|
|Compatibility:|Healthcare NLP 3.0.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk, sbert_embeddings]|
|Output Labels:|[icd10pcs_code]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on ICD10 Procedure Coding System dataset with ``sbiobert_base_cased_mli`` sentence embeddings.
https://www.icd10data.com/ICD10PCS/Codes
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from nouamanetazi)
author: John Snow Labs
name: t5_cover_letter_base
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cover-letter-t5-base` is a English model originally trained by `nouamanetazi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_cover_letter_base_en_4.3.0_3.0_1675100617268.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_cover_letter_base_en_4.3.0_3.0_1675100617268.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_cover_letter_base","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_cover_letter_base","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_cover_letter_base|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|910.7 MB|
## References
- https://huggingface.co/nouamanetazi/cover-letter-t5-base
---
layout: model
title: Fast Neural Machine Translation Model from Artificial languages to English
author: John Snow Labs
name: opus_mt_art_en
date: 2021-06-01
tags: [open_source, seq2seq, translation, art, en, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: art
target languages: en
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_art_en_xx_3.1.0_2.4_1622559545730.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_art_en_xx_3.1.0_2.4_1622559545730.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_art_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_art_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Artificial languages.translate_to.English').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_art_en|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6_en_4.3.0_3.0_1674214482569.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6_en_4.3.0_3.0_1674214482569.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|416.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-6
---
layout: model
title: Explain Document pipeline for Dutch (explain_document_lg)
author: John Snow Labs
name: explain_document_lg
date: 2021-03-23
tags: [open_source, dutch, explain_document_lg, pipeline, nl]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: nl
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_nl_3.0.0_3.0_1616513098571.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_nl_3.0.0_3.0_1616513098571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_lg', lang = 'nl')
annotations = pipeline.fullAnnotate(""Hallo van John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_lg", lang = "nl")
val result = pipeline.fullAnnotate("Hallo van John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hallo van John Snow Labs! ""]
result_df = nlu.load('nl.explain.lg').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:-------------------------------|:------------------------------|:------------------------------------------|:------------------------------------------|:--------------------------------------------|:-----------------------------|:------------------------------------------|:-----------------------------|
| 0 | ['Hallo van John Snow Labs! '] | ['Hallo van John Snow Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[-0.245989993214607,.,...]] | ['B-PER', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['Hallo', 'John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|nl|
---
layout: model
title: Legal Support Clause Binary Classifier
author: John Snow Labs
name: legclf_support_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `support` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `support`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_support_clause_en_1.0.0_3.2_1660123058175.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_support_clause_en_1.0.0_3.2_1660123058175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[support]|
|[other]|
|[other]|
|[support]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_support_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.97 0.99 0.98 120
support 0.97 0.89 0.93 35
accuracy - - 0.97 155
macro-avg 0.97 0.94 0.95 155
weighted-avg 0.97 0.97 0.97 155
```
---
layout: model
title: Legal Terminations Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_terminations_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, terminations, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Terminations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Terminations`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_terminations_bert_en_1.0.0_3.0_1678050545103.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_terminations_bert_en_1.0.0_3.0_1678050545103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Terminations]|
|[Other]|
|[Other]|
|[Terminations]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_terminations_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.96 0.89 0.93 160
Terminations 0.88 0.95 0.91 129
accuracy - - 0.92 289
macro-avg 0.92 0.92 0.92 289
weighted-avg 0.92 0.92 0.92 289
```
---
layout: model
title: SNOMED Sentence Resolver (Spanish)
author: John Snow Labs
name: robertaresolve_snomed
date: 2021-11-03
tags: [embeddings, es, snomed, entity_resolution, clinical, licensed]
task: Entity Resolution
language: es
edition: Healthcare NLP 3.3.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps predetected ner chunks (from a `MedicalNerModel`, a `ChunkConverter` and a `Chunk2Doc`) to SNOMED terms and codes for the Spanish version of SNOMED. It requires Roberta Clinical Word Embeddings (`roberta_base_biomedical_es`) averaged with `SentenceEmbeddings`.
## Predicted Entities
`SNOMED codes`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/robertaresolve_snomed_es_3.3.0_3.0_1635933551478.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/robertaresolve_snomed_es_3.3.0_3.0_1635933551478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use any `MedicalNer` Model from our ModelsHub that detects, for example, diagnosis, for Spanish. Then, use a `NerConverter` (in case your model has B-I-O notation). Create documents using `Chunk2Doc`. Then use a `Tokenizer` the split the chunk, and finally use the `roberta_base_biomedical_es` Roberta Embeddings model and a `SentenceEmbeddings` annotator with an average pooling strategy, as in the example.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
c2doc = nlp.Chunk2Doc() \
.setInputCols("ner_chunk") \
.setOutputCol("sentence")
chunk_tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
chunk_word_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\
.setInputCols(["sentence", "token"])\
.setOutputCol("ner_chunk_word_embeddings")
chunk_embeddings = nlp.SentenceEmbeddings() \
.setInputCols(["sentence", "ner_chunk_word_embeddings"]) \
.setOutputCol("ner_chunk_embeddings") \
.setPoolingStrategy("AVERAGE")
er = medical.SentenceEntityResolverModel.pretrained("robertaresolve_snomed", "es", "clinical/models")\
.setInputCols(["sentence", "ner_chunk_embeddings"]) \
.setOutputCol("snomed_code") \
.setDistanceFunction("EUCLIDEAN")
snomed_resolve_pipeline = Pipeline(stages = [
c2doc,
chunk_tokenizer,
chunk_word_embeddings,
chunk_embeddings,
er
])
empty = spark.createDataFrame([['']]).toDF("text")
p_model = snomed_resolve_pipeline.fit(empty)
test_sentence = "Mujer de 28 años con antecedentes de diabetes mellitus gestacional diagnosticada ocho años antes de la presentación y posterior diabetes mellitus tipo dos (DM2), un episodio previo de pancreatitis inducida por HTG tres años antes de la presentación, asociado con una hepatitis aguda, y obesidad con un índice de masa corporal (IMC) de 33,5 kg / m2, que se presentó con antecedentes de una semana de poliuria, polidipsia, falta de apetito y vómitos. Dos semanas antes de la presentación, fue tratada con un ciclo de cinco días de amoxicilina por una infección del tracto respiratorio. Estaba tomando metformina, glipizida y dapagliflozina para la DM2 y atorvastatina y gemfibrozil para la HTG. Había estado tomando dapagliflozina durante seis meses en el momento de la presentación. El examen físico al momento de la presentación fue significativo para la mucosa oral seca; significativamente, su examen abdominal fue benigno sin dolor a la palpación, protección o rigidez. Los hallazgos de laboratorio pertinentes al ingreso fueron: glucosa sérica 111 mg / dl, bicarbonato 18 mmol / l, anión gap 20, creatinina 0,4 mg / dl, triglicéridos 508 mg / dl, colesterol total 122 mg / dl, hemoglobina glucosilada (HbA1c) 10%. y pH venoso 7,27. La lipasa sérica fue normal a 43 U / L. Los niveles séricos de acetona no pudieron evaluarse ya que las muestras de sangre se mantuvieron hemolizadas debido a una lipemia significativa. La paciente ingresó inicialmente por cetosis por inanición, ya que refirió una ingesta oral deficiente durante los tres días previos a la admisión. Sin embargo, la química sérica obtenida seis horas después de la presentación reveló que su glucosa era de 186 mg / dL, la brecha aniónica todavía estaba elevada a 21, el bicarbonato sérico era de 16 mmol / L, el nivel de triglicéridos alcanzó un máximo de 2050 mg / dL y la lipasa fue de 52 U / L. Se obtuvo el nivel de β-hidroxibutirato y se encontró que estaba elevado a 5,29 mmol / L; la muestra original se centrifugó y la capa de quilomicrones se eliminó antes del análisis debido a la interferencia de la turbidez causada por la lipemia nuevamente. El paciente fue tratado con un goteo de insulina para euDKA y HTG con una reducción de la brecha aniónica a 13 y triglicéridos a 1400 mg / dL, dentro de las 24 horas. Se pensó que su euDKA fue precipitada por su infección del tracto respiratorio en el contexto del uso del inhibidor de SGLT2. La paciente fue atendida por el servicio de endocrinología y fue dada de alta con 40 unidades de insulina glargina por la noche, 12 unidades de insulina lispro con las comidas y metformina 1000 mg dos veces al día. Se determinó que todos los inhibidores de SGLT2 deben suspenderse indefinidamente. Tuvo un seguimiento estrecho con endocrinología post alta."
result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]})))
```
```scala
...
val c2doc = new Chunk2Doc()
.setInputCols(Array("ner_chunk"))
.setOutputCol("sentence")
val chunk_tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val chunk_word_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner_chunk_word_embeddings")
val chunk_embeddings = new SentenceEmbeddings()
.setInputCols(Array("sentence", "ner_chunk_word_embeddings"))
.setOutputCol("ner_chunk_embeddings")
.setPoolingStrategy("AVERAGE")
val er = SentenceEntityResolverModel.pretrained("robertaresolve_snomed", "es", "clinical/models")
.setInputCols(Array("ner_chunk_embeddings"))
.setOutputCol("snomed_code")
.setDistanceFunction("EUCLIDEAN")
val snomed_pipeline = new Pipeline().setStages(Array(
c2doc,
chunk_tokenizer,
chunk_word_embeddings,
chunk_embeddings,
er))
val test_sentence = 'Mujer de 28 años con antecedentes de diabetes mellitus gestacional diagnosticada ocho años antes de la presentación y posterior diabetes mellitus tipo dos (DM2), un episodio previo de pancreatitis inducida por HTG tres años antes de la presentación, asociado con una hepatitis aguda, y obesidad con un índice de masa corporal (IMC) de 33,5 kg / m2, que se presentó con antecedentes de una semana de poliuria, polidipsia, falta de apetito y vómitos. Dos semanas antes de la presentación, fue tratada con un ciclo de cinco días de amoxicilina por una infección del tracto respiratorio. Estaba tomando metformina, glipizida y dapagliflozina para la DM2 y atorvastatina y gemfibrozil para la HTG. Había estado tomando dapagliflozina durante seis meses en el momento de la presentación. El examen físico al momento de la presentación fue significativo para la mucosa oral seca; significativamente, su examen abdominal fue benigno sin dolor a la palpación, protección o rigidez. Los hallazgos de laboratorio pertinentes al ingreso fueron: glucosa sérica 111 mg / dl, bicarbonato 18 mmol / l, anión gap 20, creatinina 0,4 mg / dl, triglicéridos 508 mg / dl, colesterol total 122 mg / dl, hemoglobina glucosilada (HbA1c) 10%. y pH venoso 7,27. La lipasa sérica fue normal a 43 U / L. Los niveles séricos de acetona no pudieron evaluarse ya que las muestras de sangre se mantuvieron hemolizadas debido a una lipemia significativa. La paciente ingresó inicialmente por cetosis por inanición, ya que refirió una ingesta oral deficiente durante los tres días previos a la admisión. Sin embargo, la química sérica obtenida seis horas después de la presentación reveló que su glucosa era de 186 mg / dL, la brecha aniónica todavía estaba elevada a 21, el bicarbonato sérico era de 16 mmol / L, el nivel de triglicéridos alcanzó un máximo de 2050 mg / dL y la lipasa fue de 52 U / L. Se obtuvo el nivel de β-hidroxibutirato y se encontró que estaba elevado a 5,29 mmol / L; la muestra original se centrifugó y la capa de quilomicrones se eliminó antes del análisis debido a la interferencia de la turbidez causada por la lipemia nuevamente. El paciente fue tratado con un goteo de insulina para euDKA y HTG con una reducción de la brecha aniónica a 13 y triglicéridos a 1400 mg / dL, dentro de las 24 horas. Se pensó que su euDKA fue precipitada por su infección del tracto respiratorio en el contexto del uso del inhibidor de SGLT2. La paciente fue atendida por el servicio de endocrinología y fue dada de alta con 40 unidades de insulina glargina por la noche, 12 unidades de insulina lispro con las comidas y metformina 1000 mg dos veces al día. Se determinó que todos los inhibidores de SGLT2 deben suspenderse indefinidamente. Tuvo un seguimiento estrecho con endocrinología post alta.'
val data = Seq(test_sentence).toDF("text")
val result = snomed_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.resolve.snomed").predict("""Mujer de 28 años con antecedentes de diabetes mellitus gestacional diagnosticada ocho años antes de la presentación y posterior diabetes mellitus tipo dos (DM2), un episodio previo de pancreatitis inducida por HTG tres años antes de la presentación, asociado con una hepatitis aguda, y obesidad con un índice de masa corporal (IMC) de 33,5 kg / m2, que se presentó con antecedentes de una semana de poliuria, polidipsia, falta de apetito y vómitos. Dos semanas antes de la presentación, fue tratada con un ciclo de cinco días de amoxicilina por una infección del tracto respiratorio. Estaba tomando metformina, glipizida y dapagliflozina para la DM2 y atorvastatina y gemfibrozil para la HTG. Había estado tomando dapagliflozina durante seis meses en el momento de la presentación. El examen físico al momento de la presentación fue significativo para la mucosa oral seca; significativamente, su examen abdominal fue benigno sin dolor a la palpación, protección o rigidez. Los hallazgos de laboratorio pertinentes al ingreso fueron: glucosa sérica 111 mg / dl, bicarbonato 18 mmol / l, anión gap 20, creatinina 0,4 mg / dl, triglicéridos 508 mg / dl, colesterol total 122 mg / dl, hemoglobina glucosilada (HbA1c) 10%. y pH venoso 7,27. La lipasa sérica fue normal a 43 U / L. Los niveles séricos de acetona no pudieron evaluarse ya que las muestras de sangre se mantuvieron hemolizadas debido a una lipemia significativa. La paciente ingresó inicialmente por cetosis por inanición, ya que refirió una ingesta oral deficiente durante los tres días previos a la admisión. Sin embargo, la química sérica obtenida seis horas después de la presentación reveló que su glucosa era de 186 mg / dL, la brecha aniónica todavía estaba elevada a 21, el bicarbonato sérico era de 16 mmol / L, el nivel de triglicéridos alcanzó un máximo de 2050 mg / dL y la lipasa fue de 52 U / L. Se obtuvo el nivel de β-hidroxibutirato y se encontró que estaba elevado a 5,29 mmol / L; la muestra original se centrifugó y la capa de quilomicrones se eliminó antes del análisis debido a la interferencia de la turbidez causada por la lipemia nuevamente. El paciente fue tratado con un goteo de insulina para euDKA y HTG con una reducción de la brecha aniónica a 13 y triglicéridos a 1400 mg / dL, dentro de las 24 horas. Se pensó que su euDKA fue precipitada por su infección del tracto respiratorio en el contexto del uso del inhibidor de SGLT2. La paciente fue atendida por el servicio de endocrinología y fue dada de alta con 40 unidades de insulina glargina por la noche, 12 unidades de insulina lispro con las comidas y metformina 1000 mg dos veces al día. Se determinó que todos los inhibidores de SGLT2 deben suspenderse indefinidamente. Tuvo un seguimiento estrecho con endocrinología post alta.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
model = WordEmbeddingsModel.pretrained("embeddings_healthcare","en","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
```
```scala
val model = WordEmbeddingsModel.pretrained("embeddings_healthcare","en","clinical/models")
.setInputCols("document","token")
.setOutputCol("word_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.glove.healthcare").predict("""Put your text here.""")
```
{:.h2_title}
## Results
Word2Vec feature vectors based on ``embeddings_healthcare``.
{:.model-param}
## Model Information
{:.table-model}
|---------------|-----------------------|
| Name: | embeddings_healthcare |
| Type: | WordEmbeddingsModel |
| Compatibility: | Spark NLP 2.4.4+ |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [document, token] |
|Output labels: | [word_embeddings] |
| Language: | en |
| Dimension: | 400.0 |
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from ThaisBeham)
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_fira
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-fira` is a English model originally trained by `ThaisBeham`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_fira_en_4.3.0_3.0_1672767957708.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_fira_en_4.3.0_3.0_1672767957708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_fira","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_fira","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_fira|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/ThaisBeham/distilbert-base-uncased-finetuned-fira
---
layout: model
title: English Named Entity Recognition (from elastic)
author: John Snow Labs
name: distilbert_ner_distilbert_base_cased_finetuned_conll03_english
date: 2022-05-16
tags: [distilbert, ner, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-finetuned-conll03-english` is a English model orginally trained by `elastic`.
## Predicted Entities
`ORG`, `MISC`, `PER`, `LOC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_cased_finetuned_conll03_english_en_3.4.2_3.0_1652721683253.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_cased_finetuned_conll03_english_en_3.4.2_3.0_1652721683253.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_cased_finetuned_conll03_english","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_cased_finetuned_conll03_english","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_ner_distilbert_base_cased_finetuned_conll03_english|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english
---
layout: model
title: English DistilBertForQuestionAnswering model (from V3RX2000)
author: John Snow Labs
name: distilbert_qa_V3RX2000_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `V3RX2000`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_V3RX2000_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724840806.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_V3RX2000_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724840806.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_V3RX2000_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_V3RX2000_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_V3RX2000").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_V3RX2000_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/V3RX2000/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English asr_wav2vec2_large_960h_lv60_self TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: asr_wav2vec2_large_960h_lv60_self
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60_self` is a English model originally trained by facebook.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_960h_lv60_self_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_self_en_4.2.0_3.0_1664036965244.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_self_en_4.2.0_3.0_1664036965244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_960h_lv60_self", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_960h_lv60_self", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_960h_lv60_self|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|757.3 MB|
---
layout: model
title: Explain Document Pipeline - CARP
author: John Snow Labs
name: explain_clinical_doc_carp
date: 2020-08-19
task: [Named Entity Recognition, Assertion Status, Relation Extraction, Pipeline Healthcare]
language: en
nav_key: models
edition: Healthcare NLP 2.5.5
spark_version: 2.4
tags: [pipeline, en, clinical, licensed]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pretrained pipeline with ``ner_clinical``, ``assertion_dl``, ``re_clinical`` and ``ner_posology``. It will extract clinical and medication entities, assign assertion status and find relationships between clinical entities.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_carp_en_2.5.5_2.4_1597841630062.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_carp_en_2.5.5_2.4_1597841630062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
carp_pipeline = PretrainedPipeline("explain_clinical_doc_carp","en","clinical/models")
annotations = carp_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""")[0]
annotations.keys()
```
```scala
val carp_pipeline = new PretrainedPipeline("explain_clinical_doc_carp","en","clinical/models")
val result = carp_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""")(0)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.explain_doc.carp").predict("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""")
```
{:.h2_title}
## Results
This pretrained pipeline gives the result of `ner_clinical`, `re_clinical`, `ner_posology` and `assertion_dl` models.
```bash
| | chunks | ner_clinical | assertion | posology_chunk | ner_posology | relations |
|---|-------------------------------|--------------|-----------|------------------|--------------|-----------|
| 0 | gestational diabetes mellitus | PROBLEM | present | metformin | Drug | TrAP |
| 1 | metformin | TREATMENT | present | 1000 mg | Strength | TrCP |
| 2 | polyuria | PROBLEM | present | two times a day | Frequency | TrCP |
| 3 | polydipsia | PROBLEM | present | 40 units | Dosage | TrWP |
| 4 | poor appetite | PROBLEM | present | insulin glargine | Drug | TrCP |
| 5 | vomiting | PROBLEM | present | at night | Frequency | TrAP |
| 6 | insulin glargine | TREATMENT | present | 12 units | Dosage | TrAP |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_clinical_doc_carp|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.5.5|
|License:|Licensed|
|Edition:|Official|
|Language:|[en]|
{:.h2_title}
## Included Models
- ner_clinical
- assertion_dl
- re_clinical
- ner_posology
---
layout: model
title: Multilingual XLMRobertaForTokenClassification Base Cased model (from moghis)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_panx
date: 2022-08-14
tags: [de, fr, open_source, xlm_roberta, ner, xx]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-fr-de` is a Multilingual model originally trained by `moghis`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_panx_xx_4.1.0_3.0_1660445027313.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_panx_xx_4.1.0_3.0_1660445027313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_panx","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_panx","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|858.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/moghis/xlm-roberta-base-finetuned-panx-fr-de
---
layout: model
title: Translate English to Germanic languages Pipeline
author: John Snow Labs
name: translate_en_gem
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, gem, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `gem`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_gem_xx_2.7.0_2.4_1609688259531.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_gem_xx_2.7.0_2.4_1609688259531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_gem", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_gem", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.gem').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_gem|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Finnish BERT Embeddings (Base Cased)
author: John Snow Labs
name: bert_base_finnish_cased
date: 2022-01-03
tags: [open_source, embeddings, fi, bert]
task: Embeddings
language: fi
edition: Spark NLP 3.3.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A version of Google's BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.
FinBERT features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words than e.g. the previously released multilingual BERT models from Google.
FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train FinBERT.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_finnish_cased_fi_3.3.4_2.4_1641223279447.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_finnish_cased_fi_3.3.4_2.4_1641223279447.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_base_finnish_cased", "fi") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
sample_data= spark.createDataFrame([['Syväoppiminen perustuu keinotekoisiin hermoihin, jotka muodostavat monikerroksisen neuroverkon.']], ["text"])
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(sample_data)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_base_finnish_cased", "fi")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("Syväoppiminen perustuu keinotekoisiin hermoihin, jotka muodostavat monikerroksisen neuroverkon.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("fi.embed_sentence.bert.cased").predict("""Syväoppiminen perustuu keinotekoisiin hermoihin, jotka muodostavat monikerroksisen neuroverkon.""")
```
## Results
```bash
+--------------------+---------------+
| embeddings| token|
+--------------------+---------------+
|[0.53366333, -0.4...| Syväoppiminen|
|[0.49171034, -1.1...| perustuu|
|[-0.0017492473, -...| keinotekoisiin|
|[0.61259747, -0.7...| hermoihin|
|[-0.008151092, -0...| ,|
|[-0.4050159, -0.2...| jotka|
|[-0.69079936, 0.6...| muodostavat|
|[-0.45641452, 0.4...|monikerroksisen|
|[1.278124, -1.218...| neuroverkon|
|[0.42451048, -1.2...| .|
+--------------------+---------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_finnish_cased|
|Compatibility:|Spark NLP 3.3.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[bert]|
|Language:|fi|
|Size:|464.2 MB|
|Case sensitive:|true|
---
layout: model
title: Translate West Germanic languages to English Pipeline
author: John Snow Labs
name: translate_gmw_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, gmw, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `gmw`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gmw_en_xx_2.7.0_2.4_1609685887934.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gmw_en_xx_2.7.0_2.4_1609685887934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_gmw_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_gmw_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.gmw.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_gmw_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223448939.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223448939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|460.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0
---
layout: model
title: English BertForMaskedLM Large Cased model
author: John Snow Labs
name: bert_embeddings_large_cased
date: 2022-12-02
tags: [en, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-cased` is a English model originally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_cased_en_4.2.4_3.0_1670020019196.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_cased_en_4.2.4_3.0_1670020019196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_cased","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_cased","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_large_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/bert-large-cased
- https://arxiv.org/abs/1810.04805
- https://github.com/google-research/bert
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
---
layout: model
title: English BertForQuestionAnswering model (from bdickson)
author: John Snow Labs
name: bert_qa_bdickson_bert_base_uncased_finetuned_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `bdickson`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bdickson_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181090164.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bdickson_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181090164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bdickson_bert_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bdickson_bert_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base_uncased.by_bdickson").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bdickson_bert_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/bdickson/bert-base-uncased-finetuned-squad
---
layout: model
title: COVID BERT Embeddings (Large Uncased)
author: John Snow Labs
name: covidbert_large_uncased
date: 2020-08-27
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
BERT-large-uncased model, pretrained on a corpus of messages from Twitter about COVID-19.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/covidbert_large_uncased_en_2.6.0_2.4_1598484981419.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/covidbert_large_uncased_en_2.6.0_2.4_1598484981419.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("covidbert_large_uncased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("covidbert_large_uncased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.covidbert.large_uncased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
en_embed_covidbert_large_uncased_embeddings token
[-1.934066891670227, 0.620597779750824, 0.0967... I
[-0.5530431866645813, 1.1948248147964478, -0.0... love
[0.255395770072937, 0.5808677077293396, 0.3073... NLP
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|covidbert_large_uncased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|1024|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/digitalepidemiologylab/covid-twitter-bert/2
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-512-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0_en_4.0.0_3.0_1654180865294.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0_en_4.0.0_3.0_1654180865294.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base_uncased_512d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-512-finetuned-squad-seed-0
---
layout: model
title: Part of Speech for Persian
author: John Snow Labs
name: pos_ud_perdt
date: 2020-11-30
task: Part of Speech Tagging
language: fa
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [fa, pos]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_perdt_fa_2.7.0_2.4_1606724821106.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_perdt_fa_2.7.0_2.4_1606724821106.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_perdt", "fa") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate(["جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است."])
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_perdt", "fa")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است"""]
pos_df = nlu.load('fa.pos').predict(text)
pos_df
```
## Results
```bash
{'pos': [Annotation(pos, 0, 2, NOUN, {'word': 'جان'}),
Annotation(pos, 4, 7, NOUN, {'word': 'اسنو'}),
Annotation(pos, 9, 11, ADJ, {'word': 'جدا'}),
Annotation(pos, 13, 14, ADP, {'word': 'از'}),
Annotation(pos, 16, 20, NOUN, {'word': 'سلطنت'}),
Annotation(pos, 22, 25, NOUN, {'word': 'شمال'}),
Annotation(pos, 27, 27, PUNCT, {'word': '،'}),
Annotation(pos, 29, 30, NUM, {'word': 'یک'}),
Annotation(pos, 32, 35, NOUN, {'word': 'پزشک'}),
Annotation(pos, 37, 43, ADJ, {'word': 'انگلیسی'}),
Annotation(pos, 45, 45, CCONJ, {'word': 'و'}),
Annotation(pos, 47, 50, NOUN, {'word': 'رهبر'}),
Annotation(pos, 52, 56, NOUN, {'word': 'توسعه'}),
Annotation(pos, 58, 63, VERB, {'word': 'بیهوشی'}),
Annotation(pos, 65, 65, CCONJ, {'word': 'و'}),
Annotation(pos, 67, 72, NOUN, {'word': 'بهداشت'}),
Annotation(pos, 74, 78, ADJ, {'word': 'پزشکی'}),
Annotation(pos, 80, 82, AUX, {'word': 'است'}),
Annotation(pos, 83, 83, PUNCT, {'word': '.'})]}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_perdt|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[tags, document]|
|Output Labels:|[pos]|
|Language:|fa|
## Data Source
The model is trained on data obtained from [https://universaldependencies.org](https://universaldependencies.org)
## Benchmarking
```bash
| | | precision | recall | f1-score | support |
|---:|:-------------|:------------|:---------|-----------:|----------:|
| 0 | ADJ | 0.88 | 0.88 | 0.88 | 1647 |
| 1 | ADP | 0.99 | 0.99 | 0.99 | 3402 |
| 2 | ADV | 0.94 | 0.91 | 0.92 | 383 |
| 3 | AUX | 0.99 | 0.99 | 0.99 | 1000 |
| 4 | CCONJ | 1.00 | 1.00 | 1 | 1022 |
| 5 | DET | 0.94 | 0.96 | 0.95 | 490 |
| 6 | INTJ | 0.88 | 0.81 | 0.85 | 27 |
| 7 | NOUN | 0.95 | 0.96 | 0.95 | 8201 |
| 8 | NUM | 0.94 | 0.97 | 0.96 | 293 |
| 9 | None | 1.00 | 0.99 | 0.99 | 289 |
| 10 | PART | 1.00 | 0.86 | 0.92 | 28 |
| 11 | PRON | 0.98 | 0.97 | 0.98 | 1117 |
| 12 | PROPN | 0.84 | 0.78 | 0.81 | 1107 |
| 13 | PUNCT | 1.00 | 1.00 | 1 | 2134 |
| 14 | SCONJ | 0.98 | 0.98 | 0.98 | 630 |
| 15 | VERB | 0.99 | 0.99 | 0.99 | 2581 |
| 16 | accuracy | | | 0.96 | 24351 |
| 17 | macro avg | 0.96 | 0.94 | 0.95 | 24351 |
| 18 | weighted avg | 0.96 | 0.96 | 0.96 | 24351 |
```
---
layout: model
title: Legal Rights Agreement Document Classifier (Longformer)
author: John Snow Labs
name: legclf_rights_agreement
date: 2022-11-24
tags: [en, legal, classification, agreement, rights, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_rights_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `rights-agreement` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`rights-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_rights_agreement_en_1.0.0_3.0_1669294308500.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_rights_agreement_en_1.0.0_3.0_1669294308500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[rights-agreement]|
|[other]|
|[other]|
|[rights-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_rights_agreement|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.4 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.99 0.98 0.98 90
rights-agreement 0.94 0.97 0.95 30
accuracy - - 0.97 120
macro-avg 0.96 0.97 0.97 120
weighted-avg 0.98 0.97 0.98 120
```
---
layout: model
title: Translate Korean to English Pipeline
author: John Snow Labs
name: translate_ko_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, ko, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `ko`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ko_en_xx_2.7.0_2.4_1609688668059.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ko_en_xx_2.7.0_2.4_1609688668059.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_ko_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_ko_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.ko.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_ko_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Vietnamese Deberta Embeddings model (from hieule)
author: John Snow Labs
name: deberta_embeddings_spm_vie
date: 2023-03-12
tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, vie, tensorflow]
task: Embeddings
language: vie
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DeBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spm-vie-deberta` is a Vietnamese model originally trained by `hieule`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_spm_vie_vie_4.3.1_3.0_1678627522214.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_spm_vie_vie_4.3.1_3.0_1678627522214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_spm_vie","vie") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_spm_vie","vie")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|deberta_embeddings_spm_vie|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|vie|
|Size:|290.3 MB|
|Case sensitive:|false|
## References
https://huggingface.co/hieule/spm-vie-deberta
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_hyAM_batch4 TFWav2Vec2ForCTC from lilitket
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_hyAM_batch4` is a English model originally trained by lilitket.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_en_4.2.0_3.0_1664119468999.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_en_4.2.0_3.0_1664119468999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Second Supplemental Indenture Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_second_supplemental_indenture_bert
date: 2023-02-02
tags: [en, legal, classification, second, supplemental, indenture, licensed, bert, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_second_supplemental_indenture_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `second-supplemental-indenture` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`second-supplemental-indenture`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_second_supplemental_indenture_bert_en_1.0.0_3.0_1675359737104.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_second_supplemental_indenture_bert_en_1.0.0_3.0_1675359737104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[second-supplemental-indenture]|
|[other]|
|[other]|
|[second-supplemental-indenture]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_second_supplemental_indenture_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.95 0.97 0.96 73
second-supplemental-indenture 0.95 0.90 0.92 39
accuracy - - 0.95 112
macro-avg 0.95 0.94 0.94 112
weighted-avg 0.95 0.95 0.95 112
```
---
layout: model
title: Legal Material contracts Clause Binary Classifier
author: John Snow Labs
name: legclf_material_contracts_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `material-contracts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `material-contracts`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_material_contracts_clause_en_1.0.0_3.2_1660122646574.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_material_contracts_clause_en_1.0.0_3.2_1660122646574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[material-contracts]|
|[other]|
|[other]|
|[material-contracts]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_material_contracts_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
material-contracts 0.85 0.79 0.82 29
other 0.94 0.96 0.95 93
accuracy - - 0.92 122
macro-avg 0.89 0.88 0.88 122
weighted-avg 0.92 0.92 0.92 122
```
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-64-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2_en_4.0.0_3.0_1657185416907.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2_en_4.0.0_3.0_1657185416907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-64-finetuned-squad-seed-2
---
layout: model
title: Word2Vec Embeddings in Sanskrit (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, sa, open_source]
task: Embeddings
language: sa
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sa_3.4.1_3.0_1647455309990.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sa_3.4.1_3.0_1647455309990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sa") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sa")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("sa.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|sa|
|Size:|288.9 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Bank Complaints Classification
author: John Snow Labs
name: finclf_bank_complaints
date: 2022-08-09
tags: [en, finance, bank, classification, licensed]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model classifies Bank-related texts into different 7 different categories, and can be used to automatically process incoming emails to customer support channels and forward them to the proper recipients.
## Predicted Entities
`Accounts`, `Credit Cards`, `Credit Reporting`, `Debt Collection`, `Loans`, `Money Transfer and Currency`, `Mortgage`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_bank_complaints_en_1.0.0_3.2_1660035048303.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_bank_complaints_en_1.0.0_3.2_1660035048303.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = nlp.UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
classsifier_dl = nlp.ClassifierDLModel.pretrained("finclf_bank_complaints", "en", "finance/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("label")\
clf_pipeline = nlp.Pipeline(
stages = [
document_assembler,
embeddings,
classsifier_dl
])
light_pipeline = LightPipeline(clf_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
result = light_pipeline.annotate("""Over the course of 30 days I have filed a dispute in regards to inaccurate and false information on my credit report. Ive attached a copy of my dispute mailed in certified to Equifax and they are still reporting these incorrect items. According to the fair credit ACT, section 609 ( a ) ( 1 ) ( A ) they are required by Federal Law to only report Accurate information and the have not done so. They have not provided me with any proof i.e. and original consumer contract with my signature on it proving that this is my account.Further more, I would like to make a formal complaint that Ive tried calling Equifax Over 10 times this week and every single time Ive called Ive asked for a representative in the fraud dispute department wants transfer it over there when you speak to the representative and let them know that you are looking to dispute inquiries and accounts due to fraud they immediately transfer you to their survey line essentially ending the call. I believe Equifax is training their representatives to not help consumers over the phone and performing unethical practices. Once I finally got a hold of a representative she told me that she could not help because I did not send in my Social Security card which violates my consumer rights. So Im Making a formal CFPB complaint that you will correct Equifaxs actions. Below Ive written what is also included in the files uploaded, my disputes for inaccuracies on my credit report.""")
result['label']
```
## Results
```bash
['Credit Reporting']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_bank_complaints|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data
## Benchmarking
```bash
label precision recall f1-score support
Accounts 0.77 0.73 0.75 490
Credit_Cards 0.75 0.68 0.72 461
Credit_Reporting 0.73 0.81 0.76 488
Debt_Collection 0.72 0.72 0.72 459
Loans 0.78 0.78 0.78 472
Money_Transfer_and_Currency 0.82 0.84 0.83 482
Mortgage 0.87 0.87 0.87 488
accuracy - - 0.78 3340
macro-avg 0.78 0.78 0.78 3340
weighted-avg 0.78 0.78 0.78 3340
```
---
layout: model
title: Spanish BertForQuestionAnswering model (from MMG)
author: John Snow Labs
name: bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG
date: 2022-06-03
tags: [es, open_source, question_answering, bert]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-sqac-finetuned-squad2-es` is a Spanish model orginally trained by `MMG`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG_es_4.0.0_3.0_1654249736195.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG_es_4.0.0_3.0_1654249736195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squadv2_sqac.bert.base_cased.by_MMG").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|410.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/MMG/bert-base-spanish-wwm-cased-finetuned-sqac-finetuned-squad2-es
---
layout: model
title: SDOH Substance Usage For Binary Classification
author: John Snow Labs
name: genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli
date: 2023-01-14
tags: [en, licensed, generic_classifier, sdoh, substance, clinical]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
recommended: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Generic Classifier model is intended for detecting substance use in clinical notes and trained by using GenericClassifierApproach annotator. `Present:` if the patient was a current consumer of substance or the patient was a consumer in the past and had quit or if the patient had never consumed substance. `None:` if there was no related text.
## Predicted Entities
`Present`, `None`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli_en_4.2.4_3.0_1673697973649.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli_en_4.2.4_3.0_1673697973649.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
features_asm = FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli", 'en', 'clinical/models')\
.setInputCols(["features"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier
])
text_list = ["Lives in apartment with 16-year-old daughter. Denies EtOH use currently although reports occasional use in past. Utox on admission positive for opiate (on as rx) as well as cocaine. 4-6 cigarettes a day on and off for 10 years. Denies h/o illicit drug use besides marijuana although admitted to cocaine use after being found to have urine positive for cocaine.",
"The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago."]
df = spark.createDataFrame(text_list, StringType()).toDF("text")
result = pipeline.fit(df).transform(df)
result.select("text", "class.result").show(truncate=100)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("features")
val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli", "en", "clinical/models")
.setInputCols("features")
.setOutputCol("class")
val pipeline = new PipelineModel().setStages(Array(
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier))
val data = Seq("The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.generic.sdoh_substance_binary_sbiobert_cased").predict("""Lives in apartment with 16-year-old daughter. Denies EtOH use currently although reports occasional use in past. Utox on admission positive for opiate (on as rx) as well as cocaine. 4-6 cigarettes a day on and off for 10 years. Denies h/o illicit drug use besides marijuana although admitted to cocaine use after being found to have urine positive for cocaine.""")
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+---------+
| text| result|
+----------------------------------------------------------------------------------------------------+---------+
|Lives in apartment with 16-year-old daughter. Denies EtOH use currently although reports occasion...|[Present]|
|The patient quit smoking approximately two years ago with an approximately a 40 pack year history...| [None]|
+----------------------------------------------------------------------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[features]|
|Output Labels:|[prediction]|
|Language:|en|
|Size:|3.4 MB|
## Benchmarking
```bash
label precision recall f1-score support
None 0.91 0.83 0.87 898
Present 0.76 0.87 0.81 540
accuracy - - 0.85 1438
macro-avg 0.83 0.85 0.84 1438
weighted-avg 0.85 0.85 0.85 1438
```
---
layout: model
title: Legal Multilabel Classifier on Covid-19 Exceptions (Italian)
author: John Snow Labs
name: legmulticlf_covid19_exceptions_italian
date: 2023-04-20
tags: [it, licensed, legal, multilabel, classification, tensorflow]
task: Text Classification
language: it
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MultiClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is the Multi-Label Text Classification model that can be used to identify up to 5 classes to facilitate analysis, discovery, and comparison of legal texts in Italian related to COVID-19 exception measures. The classes are as follows:
- Closures/lockdown
- Government_oversight
- Restrictions_of_daily_liberties
- Restrictions_of_fundamental_rights_and_civil_liberties
- State_of_Emergency
## Predicted Entities
`Closures/lockdown`, `Government_oversight`, `Restrictions_of_daily_liberties`, `Restrictions_of_fundamental_rights_and_civil_liberties`, `State_of_Emergency`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_covid19_exceptions_italian_it_1.0.0_3.0_1681985472330.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_covid19_exceptions_italian_it_1.0.0_3.0_1681985472330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_bert_base_italian_xxl_cased", "it") \
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
embeddingsSentence = nlp.SentenceEmbeddings() \
.setInputCols(["document", "embeddings"])\
.setOutputCol("sentence_embeddings")\
.setPoolingStrategy("AVERAGE")
multilabelClfModel = nlp.MultiClassifierDLModel.pretrained('legmulticlf_covid19_exceptions_italian', 'it', "legal/models") \
.setInputCols(["sentence_embeddings"])\
.setOutputCol("class")
clf_pipeline = nlp.Pipeline(
stages=[document_assembler,
tokenizer,
embeddings,
embeddingsSentence,
multilabelClfModel])
df = spark.createDataFrame([["Al di fuori di tale ultima ipotesi, secondo le raccomandazioni impartite dal Ministero della salute, occorre provvedere ad assicurare la corretta applicazione di misure preventive quali lavare frequentemente le mani con acqua e detergenti comuni."]]).toDF("text")
model = clf_pipeline.fit(df)
result = model.transform(df)
result.select("text", "class.result").show(truncate=False)
```
## Results
```bash
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+
|text |result |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+
|Al di fuori di tale ultima ipotesi, secondo le raccomandazioni impartite dal Ministero della salute, occorre provvedere ad assicurare la corretta applicazione di misure preventive quali lavare frequentemente le mani con acqua e detergenti comuni.|[Restrictions_of_daily_liberties]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legmulticlf_covid19_exceptions_italian|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|it|
|Size:|13.9 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/joelito/covid19_emergency_event)
## Benchmarking
```bash
label precision recall f1-score support
Closures/lockdown 0.88 0.94 0.91 47
Government_oversight 1.00 0.50 0.67 4
Restrictions_of_daily_liberties 0.88 0.79 0.83 28
Restrictions_of_fundamental_rights_and_civil_liberties 0.62 0.62 0.62 16
State_of_Emergency 0.67 1.00 0.80 6
micro-avg 0.82 0.83 0.83 101
macro-avg 0.81 0.77 0.77 101
weighted-avg 0.83 0.83 0.83 101
samples-avg 0.81 0.84 0.81 101
```
---
layout: model
title: English image_classifier_vit_pond_image_classification_12 ViTForImageClassification from SummerChiam
author: John Snow Labs
name: image_classifier_vit_pond_image_classification_12
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_12` is a English model originally trained by SummerChiam.
## Predicted Entities
`NormalCement0`, `Boiling0`, `NormalNight0`, `Algae0`, `BoilingNight0`, `NormalRain0`, `Normal0`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_12_en_4.1.0_3.0_1660171317776.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_12_en_4.1.0_3.0_1660171317776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_pond_image_classification_12", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_pond_image_classification_12", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_pond_image_classification_12|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English asr_hausa_4_wa2vec_data_aug_xls_r_300m TFWav2Vec2ForCTC from Tiamz
author: John Snow Labs
name: pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_hausa_4_wa2vec_data_aug_xls_r_300m` is a English model originally trained by Tiamz.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m_en_4.2.0_3.0_1664108237641.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m_en_4.2.0_3.0_1664108237641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Warranty NER (md)
author: John Snow Labs
name: legner_warranty_md
date: 2022-12-01
tags: [warranty, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
IMPORTANT: Don't run this model on the whole legal agreement. Instead:
- Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration;
- Use the `legclf_warranty_clause` Text Classifier to select only these paragraphs;
This is a Legal Named Entity Recognition Model to identify the Subject (who), Action (what), Object(the indemnification) and Indirect Object (to whom) from Warranty clauses.
This is a `md` (medium version) of the classifier, trained with more data and being more resistent to false positives outside the specific section, which may help to run it at whole document level (although not recommended).
## Predicted Entities
`WARRANTY`, `WARRANTY_ACTION`, `WARRANTY_SUBJECT`, `WARRANTY_INDIRECT_OBJECT`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_warranty_md_en_1.0.0_3.0_1669893390077.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_warranty_md_en_1.0.0_3.0_1669893390077.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained('legner_warranty_md', 'en', 'legal/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[documentAssembler,sentenceDetector,tokenizer,embeddings,ner_model,ner_converter])
data = spark.createDataFrame([["""8 . Representations and Warranties SONY hereby makes the following representations and warranties to PURCHASER , each of which shall be true and correct as of the date hereof and as of the Closing Date , and shall be unaffected by any investigation heretofore or hereafter made : 8.1 Power and Authority SONY has the right and power to enter into this IP Agreement and to transfer the Transferred Patents and to grant the license set forth in Section 3.1 ."""]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
## Results
```bash
+--------------------------------------------------------------------------+------------------------+
|chunk |entity |
+--------------------------------------------------------------------------+------------------------+
|SONY |WARRANTY_SUBJECT |
|makes the following representations and warranties |WARRANTY_ACTION |
|PURCHASER |WARRANTY_INDIRECT_OBJECT|
|shall be true and correct as of the date hereof and as of the Closing Date|WARRANTY |
|shall be unaffected by any investigation |WARRANTY |
|SONY |WARRANTY_SUBJECT |
|has the right and power to enter into this IP Agreement |WARRANTY |
+--------------------------------------------------------------------------+------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_warranty_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.1 MB|
## References
In-house annotated examples from CUAD legal dataset
## Benchmarking
```bash
label tp fp fn prec rec f1
I-WARRANTY_SUBJECT 23 9 19 0.71875 0.54761904 0.62162155
B-WARRANTY 111 36 34 0.75510204 0.76551723 0.760274
B-WARRANTY_SUBJECT 55 31 33 0.6395349 0.625 0.6321839
I-WARRANTY_INDIRECT_OBJECT 18 6 3 0.75 0.85714287 0.79999995
I-WARRANTY_ACTION 77 8 14 0.90588236 0.84615386 0.875
B-WARRANTY_ACTION 36 4 4 0.9 0.9 0.9
I-WARRANTY 1686 487 313 0.7758859 0.8434217 0.8082455
B-WARRANTY_INDIRECT_OBJECT 34 12 6 0.73913044 0.85 0.79069775
Macro-average 2040 593 426 0.7730357 0.7793569 0.7761834
Micro-average 2040 593 426 0.77478164 0.8272506 0.80015695
```
---
layout: model
title: English BertForQuestionAnswering Base Cased model (from niklaspm)
author: John Snow Labs
name: bert_qa_linkbert_base_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `linkbert-base-finetuned-squad` is a English model originally trained by `niklaspm`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_linkbert_base_finetuned_squad_en_4.0.0_3.0_1657189758932.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_linkbert_base_finetuned_squad_en_4.0.0_3.0_1657189758932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_linkbert_base_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_linkbert_base_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_linkbert_base_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/niklaspm/linkbert-base-finetuned-squad
- https://arxiv.org/abs/2203.15827
---
layout: model
title: Translate English to Tok Pisin Pipeline
author: John Snow Labs
name: translate_en_tpi
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, tpi, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `tpi`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tpi_xx_2.7.0_2.4_1609686751040.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tpi_xx_2.7.0_2.4_1609686751040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_tpi", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_tpi", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.tpi').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_tpi|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Longformer (Base, 4096)
author: John Snow Labs
name: legal_longformer_base
date: 2022-10-20
tags: [en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.1
spark_version: [3.2, 3.0]
supported: true
annotator: LongformerEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
[Longformer](https://arxiv.org/abs/2004.05150) is a transformer model for long documents.
`legal_longformer_base` is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096 and it's specifically trained on *legal documents*
Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.
If you use `Longformer` in your research, please cite [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150).
```
@article{Beltagy2020Longformer,
title={Longformer: The Long-Document Transformer},
author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
journal={arXiv:2004.05150},
year={2020},
}
```
`Longformer` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/legal_longformer_base_en_4.2.1_3.2_1666282710556.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/legal_longformer_base_en_4.2.1_3.2_1666282710556.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legal_longformer_base|
|Compatibility:|Spark NLP 4.2.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|531.1 MB|
|Case sensitive:|true|
|Max sentence length:|4096|
## References
https://huggingface.co/saibo/legal-longformer-base-4096
---
layout: model
title: Pipeline to Detect Genetic Cancer Entities
author: John Snow Labs
name: ner_cancer_genetics_pipeline
date: 2023-03-15
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_cancer_genetics](https://nlp.johnsnowlabs.com/2021/03/31/ner_cancer_genetics_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_pipeline_en_4.3.0_3.2_1678864026558.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_pipeline_en_4.3.0_3.2_1678864026558.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_cancer_genetics_pipeline", "en", "clinical/models")
text = '''The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_cancer_genetics_pipeline", "en", "clinical/models")
val text = "The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.cancer_genetics.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------------------------------------------------------------|--------:|------:|:------------|-------------:|
| 0 | human KCNJ9 | 4 | 14 | protein | 0.674 |
| 1 | Kir 3.3 | 17 | 23 | protein | 0.95355 |
| 2 | GIRK3 | 26 | 30 | protein | 0.5127 |
| 3 | G-protein-activated inwardly rectifying potassium (GIRK) channel family | 52 | 122 | protein | 0.691744 |
| 4 | KCNJ9 locus | 173 | 183 | DNA | 0.97875 |
| 5 | chromosome 1q21-23 | 188 | 205 | DNA | 0.95305 |
| 6 | coding exons | 357 | 368 | DNA | 0.63345 |
| 7 | identified14 single nucleotide polymorphisms | 451 | 494 | DNA | 0.6994 |
| 8 | SNPs), | 497 | 502 | DNA | 0.79075 |
| 9 | KCNJ9 gene | 801 | 810 | DNA | 0.95605 |
| 10 | KCNJ9 protein | 868 | 880 | protein | 0.844 |
| 11 | locus | 931 | 935 | DNA | 0.9685 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_cancer_genetics_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_12_h_256
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-256` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_256_zh_4.2.4_3.0_1670021539975.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_256_zh_4.2.4_3.0_1670021539975.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_256","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_256","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_12_h_256|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|57.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-12_H-256
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Mapping RXNORM Codes with Their Corresponding UMLS Codes
author: John Snow Labs
name: rxnorm_umls_mapper
date: 2022-06-26
tags: [rxnorm, umls, chunk_mapper, clinical, licensed, en]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps RXNORM codes to corresponding UMLS codes.
## Predicted Entities
`umls_code`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapper_en_3.5.3_3.0_1656276292081.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapper_en_3.5.3_3.0_1656276292081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli", "en","clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
rxnorm_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\
.setInputCols(["ner_chunk", "sbert_embeddings"])\
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
chunkerMapper = ChunkMapperModel\
.pretrained("rxnorm_umls_mapper", "en", "clinical/models")\
.setInputCols(["rxnorm_code"])\
.setOutputCol("umls_mappings")\
.setRels(["umls_code"])
pipeline = Pipeline(stages = [
documentAssembler,
sbert_embedder,
rxnorm_resolver,
chunkerMapper
])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_pipeline= LightPipeline(model)
result = light_pipeline.fullAnnotate("amlodipine 5 MG")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols("ner_chunk")
.setOutputCol("sbert_embeddings")
val rxnorm_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val chunkerMapper = ChunkMapperModel
.pretrained("rxnorm_umls_mapper", "en", "clinical/models")
.setInputCols("rxnorm_code")
.setOutputCol("umls_mappings")
.setRels(Array("umls_code"))
val pipeline = new Pipeline(stages = Array(
documentAssembler,
sbert_embedder,
rxnorm_resolver,
chunkerMapper
))
val data = Seq("amlodipine 5 MG").toDS.toDF("text")
val result= pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.rxnorm_to_umls").predict("""amlodipine 5 MG""")
```
## Results
```bash
| | ner_chunk | rxnorm_code | umls_mappings |
|---:|:----------------|--------------:|:----------------|
| 0 | amlodipine 5 MG | 329528 | C1124796 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|rxnorm_umls_mapper|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[rxnorm_code]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|1.9 MB|
## References
This pretrained model maps RXNORM codes to corresponding UMLS codes under the Unified Medical Language System (UMLS).
---
layout: model
title: Pipeline to Detect Clinical Entities (ner_eu_clinical_case - es)
author: John Snow Labs
name: ner_eu_clinical_case_pipeline
date: 2023-03-08
tags: [es, clinical, licensed, ner]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_eu_clinical_case](https://nlp.johnsnowlabs.com/2023/02/01/ner_eu_clinical_case_es.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_es_4.3.0_3.2_1678261388612.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_es_4.3.0_3.2_1678261388612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_eu_clinical_case_pipeline", "es", "clinical/models")
text = "
Un niño de 3 años con trastorno autista en el hospital de la sala pediátrica A del hospital universitario. No tiene antecedentes familiares de enfermedad o trastorno del espectro autista. El niño fue diagnosticado con un trastorno de comunicación severo, con dificultades de interacción social y retraso en el procesamiento sensorial. Los análisis de sangre fueron normales (hormona estimulante de la tiroides (TSH), hemoglobina, volumen corpuscular medio (MCV) y ferritina). La endoscopia alta también mostró un tumor submucoso que causaba una obstrucción subtotal de la salida gástrica. Ante la sospecha de tumor del estroma gastrointestinal, se realizó gastrectomía distal. El examen histopatológico reveló proliferación de células fusiformes en la capa submucosa.
"
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_eu_clinical_case_pipeline", "es", "clinical/models")
val text = "
Un niño de 3 años con trastorno autista en el hospital de la sala pediátrica A del hospital universitario. No tiene antecedentes familiares de enfermedad o trastorno del espectro autista. El niño fue diagnosticado con un trastorno de comunicación severo, con dificultades de interacción social y retraso en el procesamiento sensorial. Los análisis de sangre fueron normales (hormona estimulante de la tiroides (TSH), hemoglobina, volumen corpuscular medio (MCV) y ferritina). La endoscopia alta también mostró un tumor submucoso que causaba una obstrucción subtotal de la salida gástrica. Ante la sospecha de tumor del estroma gastrointestinal, se realizó gastrectomía distal. El examen histopatológico reveló proliferación de células fusiformes en la capa submucosa.
"
val result = pipeline.fullAnnotate(text)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_nepeng_lid_lince","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_nepeng_lid_lince","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_codeswitch_nepeng_lid_lince|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/sagorsarker/codeswitch-nepeng-lid-lince
- https://ritual.uh.edu/lince/home
- https://github.com/sagorbrur/codeswitch
---
layout: model
title: Translate English to Indonesian Pipeline
author: John Snow Labs
name: translate_en_id
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, id, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `id`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_id_xx_2.7.0_2.4_1609690363498.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_id_xx_2.7.0_2.4_1609690363498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_id", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_id", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.id').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_id|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: German T5ForConditionalGeneration Cased model (from dehio)
author: John Snow Labs
name: t5_german_qg_e2e_quad
date: 2023-01-30
tags: [de, open_source, t5, tensorflow]
task: Text Generation
language: de
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `german-qg-t5-e2e-quad` is a German model originally trained by `dehio`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_german_qg_e2e_quad_de_4.3.0_3.0_1675102645662.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_german_qg_e2e_quad_de_4.3.0_3.0_1675102645662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_german_qg_e2e_quad","de") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_german_qg_e2e_quad","de")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_german_qg_e2e_quad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|de|
|Size:|924.3 MB|
## References
- https://huggingface.co/dehio/german-qg-t5-e2e-quad
---
layout: model
title: Pipeline to Detect PHI in text (enriched-biobert)
author: John Snow Labs
name: ner_deid_enriched_biobert_pipeline
date: 2023-03-20
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_enriched_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_enriched_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_pipeline_en_4.3.0_3.2_1679316429600.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_pipeline_en_4.3.0_3.2_1679316429600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_enriched_biobert_pipeline", "en", "clinical/models")
text = '''A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_deid_enriched_biobert_pipeline", "en", "clinical/models")
val text = "A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.ner_enriched_biobert.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------------------|--------:|------:|:-------------|-------------:|
| 0 | 2093-01-13 | 17 | 26 | DATE | 0.9267 |
| 1 | David Hale | 29 | 38 | DOCTOR | 0.7949 |
| 2 | Hendrickson, Ora | 53 | 68 | PATIENT | 0.637733 |
| 3 | 7194334 | 76 | 82 | PHONE | 0.4939 |
| 4 | Cocke County Baptist Hospital | 114 | 142 | HOSPITAL | 0.6199 |
| 5 | 0295 Keats Street | 145 | 161 | STREET | 0.592433 |
| 6 | 302) 786-5227 | 174 | 186 | PHONE | 0.846833 |
| 7 | Brothers Coal-Mine | 253 | 270 | ORGANIZATION | 0.45085 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_enriched_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.2 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Chuvash RobertaForQuestionAnswering (from sunitha)
author: John Snow Labs
name: roberta_qa_CV_Custom_DS
date: 2022-06-20
tags: [open_source, question_answering, roberta]
task: Question Answering
language: cv
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `CV_Custom_DS` is a Chuvash model originally trained by `sunitha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_CV_Custom_DS_cv_4.0.0_3.0_1655726596821.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_CV_Custom_DS_cv_4.0.0_3.0_1655726596821.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_CV_Custom_DS","cv") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_CV_Custom_DS","cv")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("cv.answer_question.roberta.cv_custom_ds.by_sunitha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_CV_Custom_DS|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|cv|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sunitha/CV_Custom_DS
---
layout: model
title: English Deberta Embeddings model (from smeoni)
author: John Snow Labs
name: deberta_embeddings_nbme_V3_large
date: 2023-03-13
tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow]
task: Embeddings
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DeBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `nbme-deberta-V3-large` is a English model originally trained by `smeoni`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_nbme_V3_large_en_4.3.1_3.0_1678713648667.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_nbme_V3_large_en_4.3.1_3.0_1678713648667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_nbme_V3_large","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_nbme_V3_large","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|deberta_embeddings_nbme_V3_large|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.6 GB|
|Case sensitive:|false|
## References
https://huggingface.co/smeoni/nbme-deberta-V3-large
---
layout: model
title: English BertForMaskedLM Large Cased model (from VMware)
author: John Snow Labs
name: bert_embeddings_v_2021_large
date: 2022-12-02
tags: [en, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vbert-2021-large` is a English model originally trained by `VMware`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_large_en_4.2.4_3.0_1670023012204.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_large_en_4.2.4_3.0_1670023012204.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_large","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_large","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_v_2021_large|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/VMware/vbert-2021-large
- https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99
---
layout: model
title: English BertForQuestionAnswering Uncased model (from aodiniz)
author: John Snow Labs
name: bert_qa_uncased_l_2_h_128_a_2_squad2
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-2_H-128_A-2_squad2` is a English model originally trained by `aodiniz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_128_a_2_squad2_en_4.0.0_3.0_1657188893270.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_128_a_2_squad2_en_4.0.0_3.0_1657188893270.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_128_a_2_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_128_a_2_squad2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_uncased_l_2_h_128_a_2_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|16.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/aodiniz/bert_uncased_L-2_H-128_A-2_squad2
---
layout: model
title: Detect Entities (GloVe)
author: John Snow Labs
name: ner_dl
date: 2020-03-19
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 2.4.3
spark_version: 2.4
tags: [ner, en, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
`ner_dl` is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. It was trained on the CoNLL 2003 text corpus. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. `ner_dl` model is trained with GloVe 100D word embeddings, so be sure to use the same embeddings in the pipeline.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_dl_en_2.4.3_2.4_1584624950746.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_dl_en_2.4.3_2.4_1584624950746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## Predicted Entities
Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`.
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained('glove_100d') \
.setInputCols(["document", 'token']) \
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained("ner_dl", "en") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("glove_100d")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("ner_dl", "en")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""]
ner_df = nlu.load('en.ner.dl.glove.6B_100d').predict(text, output_level = "chunk")
ner_df[["entities", "entities_confidence"]]
```
{:.h2_title}
## Results
```bash
+-------------------------------+---------+
|chunk |ner_label|
+-------------------------------+---------+
|William Henry Gates III |PER |
|American |MISC |
|Microsoft Corporation |ORG |
|Microsoft |ORG |
|Gates |PER |
|Born |LOC |
|Seattle |LOC |
|Washington |LOC |
|Gates |PER |
|Microsoft |ORG |
|Paul Allen |PER |
|Albuquerque |LOC |
|New Mexico |LOC |
|Gates |PER |
|Gates |PER |
|Gates |PER |
|Microsoft |ORG |
|Bill & Melinda Gates Foundation|ORG |
|Melinda Gates |PER |
|Ray Ozzie |PER |
+-------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_dl|
|Type:|ner|
|Compatibility:| Spark NLP 2.4.3+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is trained based on data from[CoNLL 2003 Data Set](https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003)
---
layout: model
title: Korean Bert Embeddings
author: John Snow Labs
name: bert_embeddings_bert_base
date: 2022-04-11
tags: [bert, embeddings, ko, open_source]
task: Embeddings
language: ko
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base` is a Korean model orginally trained by `klue`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ko_3.4.2_3.0_1649675453798.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ko_3.4.2_3.0_1649675453798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base","ko") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base","ko")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.embed.bert").predict("""나는 Spark NLP를 좋아합니다""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ko|
|Size:|415.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/klue/bert-base
- https://github.com/KLUE-benchmark/KLUE
- https://arxiv.org/abs/2105.09680
---
layout: model
title: Persian BertForQuestionAnswering model (from ForutanRad)
author: John Snow Labs
name: bert_qa_bert_fa_QA_v1
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: fa
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-fa-QA-v1` is a Persian model orginally trained by `ForutanRad`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_fa_QA_v1_fa_4.0.0_3.0_1654181654761.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_fa_QA_v1_fa_4.0.0_3.0_1654181654761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_fa_QA_v1","fa") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_fa_QA_v1","fa")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("fa.answer_question.bert.by_ForutanRad").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_fa_QA_v1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fa|
|Size:|607.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ForutanRad/bert-fa-QA-v1
- https://arxiv.org/abs/2005.12515
---
layout: model
title: Slovak RobertaForTokenClassification Cased model (from crabz)
author: John Snow Labs
name: roberta_token_classifier_slovakbert_ner
date: 2023-03-01
tags: [sk, open_source, roberta, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: sk
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `slovakbert-ner` is a Slovak model originally trained by `crabz`.
## Predicted Entities
`4`, `2`, `6`, `1`, `0`, `5`, `3`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_slovakbert_ner_sk_4.3.0_3.0_1677703644531.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_slovakbert_ner_sk_4.3.0_3.0_1677703644531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_slovakbert_ner","sk") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_slovakbert_ner","sk")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_token_classifier_slovakbert_ner|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|sk|
|Size:|439.1 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/crabz/slovakbert-ner
- https://paperswithcode.com/sota?task=Token+Classification&dataset=wikiann
---
layout: model
title: Maltese Lemmatizer
author: John Snow Labs
name: lemma
date: 2021-04-02
tags: [mt, open_source, lemmatizer]
task: Lemmatization
language: mt
edition: Spark NLP 2.7.5
spark_version: 2.4
supported: true
annotator: LemmatizerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a dictionary-based lemmatizer that assigns all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_mt_2.7.5_2.4_1617376734828.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_mt_2.7.5_2.4_1617376734828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma", "mt") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["Il- Membru tal- Kumitat Leo Brincat talab li bħala xhud ikun hemm rappreżentant tal- MEPA u kien hemm qbil filwaqt li d- Deputat Laburista Joe Mizzi ta lista ta' persuni oħrajn mill- Korporazzjoni Enemalta u minn WasteServ u ma kienx hemm oġġezzjoni ."]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma", "mt")
.setInputCols("token")
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("Il- Membru tal- Kumitat Leo Brincat talab li bħala xhud ikun hemm rappreżentant tal- MEPA u kien hemm qbil filwaqt li d- Deputat Laburista Joe Mizzi ta lista ta' persuni oħrajn mill- Korporazzjoni Enemalta u minn WasteServ u ma kienx hemm oġġezzjoni .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["Il- Membru tal- Kumitat Leo Brincat talab li bħala xhud ikun hemm rappreżentant tal- MEPA u kien hemm qbil filwaqt li d- Deputat Laburista Joe Mizzi ta lista ta' persuni oħrajn mill- Korporazzjoni Enemalta u minn WasteServ u ma kienx hemm oġġezzjoni ."]
lemma_df = nlu.load('mt.lemma').predict(text, output_level = "document")
lemma_df.lemma.values[0]
```
## Results
```bash
+-------+
| lemma|
+-------+
| Il|
| _|
| _|
| tal|
| _|
| _|
| Leo|
|Brincat|
| _|
| _|
| _|
| _|
| _|
| _|
| _|
| tal|
| _|
| MEPA|
| _|
| _|
+-------+
only showing top 20 rows
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Compatibility:|Spark NLP 2.7.5+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|mt|
## Data Source
The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) version 2.7.
## Benchmarking
```bash
Precision=0.078, Recall=0.073, F1-score=0.075
```
---
layout: model
title: Named Entity Recognition - BERT Small (OntoNotes)
author: John Snow Labs
name: onto_small_bert_L4_512
date: 2020-12-05
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [ner, open_source, en]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc.
This model uses the pretrained `small_bert_L4_512` embeddings model from the `BertEmbeddings` annotator as an input.
## Predicted Entities
`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_small_bert_L4_512_en_2.7.0_2.4_1607199400149.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_small_bert_L4_512_en_2.7.0_2.4_1607199400149.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
ner_onto = NerDLModel.pretrained("onto_small_bert_L4_512", "en") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"]))
```
```scala
...
val ner_onto = NerDLModel.pretrained("onto_small_bert_L4_512", "en")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter))
val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""]
ner_df = nlu.load('en.ner.onto.bert.small_l4_512').predict(text, output_level='chunk')
ner_df[["entities", "entities_class"]]
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_rust_image_classification_5", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_rust_image_classification_5", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_rust_image_classification_5|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Company to Ticker using Nasdaq
author: John Snow Labs
name: finel_nasdaq_data_ticker
date: 2022-10-22
tags: [en, finance, companies, nasdaq, ticker, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Financial Entity Resolver model, trained to obtain the ticker from a Company Name, registered in NASDAQ. You can use this model after extracting a company name using any NER, and you will obtain its ticker.
After this, you can use `finmapper_nasdaq_data_ticker` to augment and obtain more information about a company using NASDAQ datasource, including Official Company Name, Sector, Location, Currency, etc.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/ER_EDGAR_CRUNCHBASE/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_data_ticker_en_1.0.0_3.0_1666473763228.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_data_ticker_en_1.0.0_3.0_1666473763228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+----------------------+-------+
|text |result |
+----------------------+-------+
|FIDUS INVESTMENT corp |[FDUS] |
|ASPECT DEVELOPMENT Inc|[ASDV] |
|CFSB BANCORP |[CFSB] |
|DALEEN TECHNOLOGIES |[DALN1]|
|GLEASON Corporation |[GLE1] |
+----------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finel_nasdaq_data_ticker|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[normalized]|
|Language:|en|
|Size:|69.8 MB|
|Case sensitive:|false|
## References
NASDAQ Database
---
layout: model
title: German BERT Base Uncased Model
author: John Snow Labs
name: bert_base_german_uncased
date: 2021-05-20
tags: [open_source, embeddings, german, de, bert]
task: Embeddings
language: de
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The source data for the model consists of a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl. This results in a dataset with a size of 16GB and 2,350,234,427 tokens. The model is trained with an initial sequence length of 512 subwords and was performed for 1.5M steps.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_german_uncased_de_3.1.0_2.4_1621504361619.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_german_uncased_de_3.1.0_2.4_1621504361619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = BertEmbeddings.pretrained("bert_base_german_uncased", "de") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
```
```scala
val embeddings = BertEmbeddings.pretrained("bert_base_german_uncased", "de")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
nlu.load("de.embed.bert.uncased").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_german_uncased|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[embeddings]|
|Language:|de|
|Case sensitive:|true|
## Data Source
https://huggingface.co/dbmdz/bert-base-german-uncased
## Benchmarking
```bash
For results on downstream tasks like NER or PoS tagging, please refer to
[this repository](https://github.com/stefan-it/fine-tuned-berts-seq).
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Bemba (Zambia)
author: John Snow Labs
name: opus_mt_en_bem
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, bem, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `bem`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_bem_xx_2.7.0_2.4_1609171023473.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_bem_xx_2.7.0_2.4_1609171023473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_bem", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_bem", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.bem').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_bem|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Detect Generic PHI for Deidentification (Arabic)
author: John Snow Labs
name: ner_deid_generic
date: 2023-05-30
tags: [licensed, ner, clinical, deidentifiction, generic, arabic, ar]
task: Named Entity Recognition
language: ar
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Arabic) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities. This NER model is trained with a combination of custom datasets, and several data augmentation mechanisms. This model Word2Vec Arabic Clinical Embeddings.
## Predicted Entities
`CONTACT`, `NAME`, `DATE`, `ID`, `SEX`, `LOCATION`, `PROFESSION`, `AGE`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_ar_4.4.2_3.0_1685443881012.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_ar_4.4.2_3.0_1685443881012.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner_subentity = MedicalNerModel.pretrained("ner_deid_generic", "ar", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipelineGeneric = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner_subentity,
ner_converter])
text = '''
ملاحظات سريرية - مريض الربو:
التاريخ: 30 مايو 2023
اسم المريض: أحمد سليمان
العنوان: شارع السلام، مبنى رقم 555، حي الصفاء، الرياض
الرمز البريدي: 54321
البلد: المملكة العربية السعودية
اسم المستشفى: مستشفى الأمانة
اسم الطبيب: د. ريم الحمد
تفاصيل الحالة:
المريض أحمد سليمان، البالغ من العمر 30 عامًا، يعاني من مرض الربو المزمن. يشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصه بمرض الربو بناءً على تاريخه الطبي واختبارات وظائف الرئة.
'''
data = spark.createDataFrame([[text]]).toDF("text")
results = nlpPipeline .fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "ar", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter))
text = '''
ملاحظات سريرية - مريض الربو:
التاريخ: 30 مايو 2023
اسم المريض: أحمد سليمان
العنوان: شارع السلام، مبنى رقم 555، حي الصفاء، الرياض
الرمز البريدي: 54321
البلد: المملكة العربية السعودية
اسم المستشفى: مستشفى الأمانة
اسم الطبيب: د. ريم الحمد
تفاصيل الحالة:
المريض أحمد سليمان، البالغ من العمر 30 عامًا، يعاني من مرض الربو المزمن. يشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصه بمرض الربو بناءً على تاريخه الطبي واختبارات وظائف الرئة.
'''
val data = Seq(text).toDS.toDF("text")
val results = pipeline.fit(data).transform(data)
```
## Results
```bash
+-----------------+---------------------+
|chunk | ner_label |
+-----------------+---------------------+
|30 مايو |DATE |
|أحمد سليمان |NAME |
|الرياض |LOCATION |
|54321 |LOCATION |
|المملكة العربية |LOCATION |
|السعودية |LOCATION |
|مستشفى الأمانة |LOCATION |
|ريم الحمد |NAME |
|أحمد |NAME |
+---------------+----------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_generic|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ar|
|Size:|15.0 MB|
## References
Custom John Snow Labs datasets
Data augmentation techniques
## Benchmarking
```bash
label tp fp fn total precision recall f1
CONTACT 146.0 0.0 6.0 152.0 1.0 0.9605 0.9799
NAME 685.0 25.0 25.0 710.0 0.9648 0.9648 0.9648
DATE 876.0 14.0 9.0 885.0 0.9843 0.9898 0.987
ID 28.0 9.0 2.0 30.0 0.7568 0.9333 0.8358
SEX 300.0 8.0 69.0 369.0 0.974 0.813 0.8863
LOCATION 689.0 48.0 38.0 727.0 0.9349 0.9477 0.9413
PROFESSION 303.0 20.0 32.0 335.0 0.9381 0.9045 0.921
AGE 608.0 7.0 9.0 617.0 0.9886 0.9854 0.987
macro - - - - - - 0.9378
micro - - - - - - 0.9572
```
---
layout: model
title: Clean documents pipeline for English
author: John Snow Labs
name: clean_stop
date: 2021-03-24
tags: [open_source, english, clean_stop, pipeline, en]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The clean_stop is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clean_stop_en_3.0.0_3.0_1616544492033.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clean_stop_en_3.0.0_3.0_1616544492033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('clean_stop', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("clean_stop", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.clean.stop').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | cleanTokens |
|---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:---------------------------------------|
| 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | ['Hello', 'John', 'Snow', 'Labs', '!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clean_stop|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from alon-albalak)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_large_xquad
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-xquad` is a English model originally trained by `alon-albalak`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_xquad_en_4.0.0_3.0_1655996505419.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_xquad_en_4.0.0_3.0_1655996505419.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_large_xquad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_large_xquad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.xquad.xlm_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_large_xquad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.8 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/alon-albalak/xlm-roberta-large-xquad
- https://github.com/deepmind/xquad
---
layout: model
title: Fast Neural Machine Translation Model from English to Chinese
author: John Snow Labs
name: opus_mt_en_zh
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, zh, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `zh`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_zh_xx_2.7.0_2.4_1609168259647.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_zh_xx_2.7.0_2.4_1609168259647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_zh", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_zh", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.zh').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_zh|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011)
author: John Snow Labs
name: distilbert_token_classifier_autotrain_name_vsv_all_901529445
date: 2023-03-14
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_vsv_all-901529445` is a English model originally trained by `ismail-lucifer011`.
## Predicted Entities
`Name`, `OOV`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_vsv_all_901529445_en_4.3.1_3.0_1678783317887.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_vsv_all_901529445_en_4.3.1_3.0_1678783317887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_vsv_all_901529445","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_vsv_all_901529445","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_autotrain_name_vsv_all_901529445|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ismail-lucifer011/autotrain-name_vsv_all-901529445
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from deepset)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_base_squad2_distilled
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-squad2-distilled` is a English model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_squad2_distilled_en_4.0.0_3.0_1655991460437.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_squad2_distilled_en_4.0.0_3.0_1655991460437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_squad2_distilled","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_base_squad2_distilled","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.xlm_roberta.distilled_base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_base_squad2_distilled|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|854.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/deepset/xlm-roberta-base-squad2-distilled
- https://www.linkedin.com/company/deepset-ai/
- https://twitter.com/deepset_ai
- http://www.deepset.ai/jobs
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack/
- https://github.com/deepset-ai/FARM
- https://deepset.ai/germanquad
- https://deepset.ai
- https://deepset.ai/german-bert
- https://github.com/deepset-ai/haystack/discussions
---
layout: model
title: English DistilBertForQuestionAnswering model (from vkrishnamoorthy)
author: John Snow Labs
name: distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vkrishnamoorthy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726595722.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726595722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_vkrishnamoorthy").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/vkrishnamoorthy/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Pipeline to Detect Restaurant-related Terminology
author: John Snow Labs
name: nerdl_restaurant_100d_pipeline
date: 2022-03-18
tags: [restaurant, ner, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [nerdl_restaurant_100d](https://nlp.johnsnowlabs.com/2021/12/31/nerdl_restaurant_100d_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_RESTAURANT/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_RESTAURANT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_restaurant_100d_pipeline_en_3.4.1_3.0_1647610686318.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_restaurant_100d_pipeline_en_3.4.1_3.0_1647610686318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
restaurant_pipeline = PretrainedPipeline("nerdl_restaurant_100d_pipeline", lang = "en")
restaurant_pipeline.annotate("Hong Kong’s favourite pasta bar also offers one of the most reasonably priced lunch sets in town! With locations spread out all over the territory Sha Tin – Pici’s formidable lunch menu reads like a highlight reel of the restaurant. Choose from starters like the burrata and arugula salad or freshly tossed tuna tartare, and reliable handmade pasta dishes like pappardelle. Finally, round out your effortless Italian meal with a tidy one-pot tiramisu, of course, an espresso to power you through the rest of the day.")
```
```scala
val restaurant_pipeline = new PretrainedPipeline("nerdl_restaurant_100d_pipeline", lang = "en")
restaurant_pipeline.annotate("Hong Kong’s favourite pasta bar also offers one of the most reasonably priced lunch sets in town! With locations spread out all over the territory Sha Tin – Pici’s formidable lunch menu reads like a highlight reel of the restaurant. Choose from starters like the burrata and arugula salad or freshly tossed tuna tartare, and reliable handmade pasta dishes like pappardelle. Finally, round out your effortless Italian meal with a tidy one-pot tiramisu, of course, an espresso to power you through the rest of the day.")
```
## Results
```bash
+---------------------------+---------------+
|chunk |ner_label |
+---------------------------+---------------+
|Hong Kong’s |Restaurant_Name|
|favourite |Rating |
|pasta bar |Dish |
|most reasonably |Price |
|lunch |Hours |
|in town! |Location |
|Sha Tin – Pici’s |Restaurant_Name|
|burrata |Dish |
|arugula salad |Dish |
|freshly tossed tuna tartare|Dish |
|reliable |Price |
|handmade pasta |Dish |
|pappardelle |Dish |
|effortless |Amenity |
|Italian |Cuisine |
|tidy one-pot |Amenity |
|espresso |Dish |
+---------------------------+---------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|nerdl_restaurant_100d_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|166.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- NerDLModel
- NerConverter
- Finisher
---
layout: model
title: Social Determinants of Health
author: John Snow Labs
name: ner_sdoh
date: 2023-06-13
tags: [clinical, en, social_determinants, ner, public_health, sdoh, licensed]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
SDOH NER model is designed to detect and label social determinants of health (SDOH) entities within text data. Social determinants of health are crucial factors that influence individuals' health outcomes, encompassing various social, economic, and environmental element.
The model has been trained using advanced machine learning techniques on a diverse range of text sources. The model can accurately recognize and classify a wide range of SDOH entities, including but not limited to factors such as socioeconomic status, education level, housing conditions, access to healthcare services, employment status, cultural and ethnic background, neighborhood characteristics, and environmental factors. The model's accuracy and precision have been carefully validated against expert-labeled data to ensure reliable and consistent results. Here are the labels of the SDOH NER model with their description:
- `Access_To_Care`: Patient’s ability or barriers to access the care needed. "long distances, access to health care, rehab program, etc."
- `Age`: All mention of ages including "Newborn, Infant, Child, Teenager, Teenage, Adult, etc."
- `Alcohol`: Mentions of alcohol drinking habit.
- `Chidhood_Event`: Childhood events mentioned by the patient. "childhood trauma, childhood abuse, etc."
- `Communicable_Disease`: Include all the communicable diseases. "HIV, hepatitis, tuberculosis, sexually transmitted diseases, etc."
- `Community_Safety`: safety of the neighborhood or places of study or work. "dangerous neighborhood, safe area, etc."
- `Diet`: Information regarding the patient’s dietary habits. "vegetarian, vegan, healthy foods, low-calorie diet, etc."
- `Disability`: Mentions related to disability
- `Eating_Disorder`: This entity is used to extract eating disorders. "anorexia, bulimia, pica, etc."
- `Education`:Patient’s educational background
- `Employment`: Patient or provider occupational titles.
- `Environmental_Condition`: Conditions of the environment where people live. "pollution, air quality, noisy environment, etc."
- `Exercise`: Mentions of the exercise habits of a patient. "exercise, physical activity, play football, go to the gym, etc."
- `Family_Member`: Nouns that refer to a family member. "mother, father, brother, sister, etc."
- `Financial_Status`: Financial status refers to the state and condition of the person’s finances. "financial decline, debt, bankruptcy, etc."
- `Food_Insecurity`: Food insecurity is defined as a lack of consistent access to enough food for every person in a household to live an active, healthy life. "food insecurity, scarcity of protein, lack of food, etc."
- `Gender`: Gender-specific nouns and pronouns
- `Geographic_Entity`: Geographical location refers to a specific physical point on Earth.
- `Healthcare_Institution`: Health care institution means every place, institution, building or agency. "hospital, clinic, trauma centers, etc."
- `Housing`: Conditions of the patient’s living spaces. "homeless, housing, small apartment, etc."
- `Hyperlipidemia`: Terms that indicate hyperlipidemia and relevant subtypes. "hyperlipidemia, hypercholesterolemia, elevated cholesterol, etc."
- `Hypertension`: Terms related to hypertension. "hypertension, high blood pressure, etc."
- `Income`: Information regarding the patient’s income
- `Insurance_Status`: Information regarding the patient’s insurance status. "uninsured, insured, Medicare, Medicaid, etc."
- `Language`: A system of conventional spoken, manual (signed) or written symbols by means of which human beings express themselves. "English, Spanish-speaking, bilingual, etc. "
- `Legal_Issues`: Issues that have legal implications. "legal issues, legal problems, detention , in prison, etc."
- `Marital_Status`: Terms that indicate the person’s marital status.
- `Mental_Health`: Include all the mental, neurodegenerative and neurodevelopmental diagnosis, disorders, conditions or syndromes mentioned. "depression, anxiety, bipolar disorder, psychosis, etc."
- `Obesity`: Terms related to the patient being obese. "obesity, overweight, etc."
- `Other_Disease`: Include all the diseases mentioned. "psoriasis, thromboembolism, etc."
- `Other_SDoH_Keywords`: This label is used to annotated terms or sentences that provide information about social determinants of health that are not already extracted under any other entity label. "minimal activities of daily living, ack of government programs, etc."
- `Population_Group`: The population group that a person belongs to, that does not fall under any other entities. "refugee, prison patient, etc."
- `Quality_Of_Life`: Quality of life refers to how an individual feels about their current station in life. " lower quality of life, profoundly impact his quality of life, etc."
- `Race_Ethnicity`: The race and ethnicity categories include racial, ethnic, and national origins.
- `Sexual_Activity`: Mentions of patient’s sexual behaviors. "monogamous, sexual activity, inconsistent condom use, etc."
- `Sexual_Orientation`: Terms that are related to sexual orientations. "gay, bisexual, heterosexual, etc."
- `Smoking`: mentions of smoking habit. "smoking, cigarette, tobacco, etc."
- `Social_Exclusion`: Absence or lack of rights or accessibility to services or goods that are expected of the majority of the population. "social exclusion, social isolation, gender discrimination, etc."
- `Social_Support`: he presence of friends, family or other people to turn to for comfort or help. "social support, live with family, etc."
- `Spiritual_Beliefs`: Spirituality is concerned with beliefs beyond self, usually related to the existence of a superior being. "spiritual beliefs, religious beliefs, strong believer, etc."
- `Substance_Duration`: The duration associated with the health behaviors. "for 2 years, 3 months, etc"
- `Substance_Frequency`: The frequency associated with the health behaviors. "five days a week, daily, weekly, monthly, etc"
- `Substance_Quantity`: The quantity associated with the health behaviors. "2 packs , 40 ounces,ten to twelve, modarate, etc"
- `Substance_Use`: Mentions of illegal recreational drugs use. Include also substances that can create dependency including here caffeine and tea. "overdose, cocaine, illicit substance intoxication, coffee etc."
- `Transportation`: mentions of accessibility to transportation means. "car, bus, train, etc."
- `Violence_Or_Abuse`: Episodes of abuse or violence experienced and reported by the patient. "domestic violence, sexual abuse, etc."
## Predicted Entities
`Access_To_Care`, `Age`, `Alcohol`, `Chidhood_Event`, `Communicable_Disease`, `Community_Safety`, `Diet`, `Disability`, `Eating_Disorder`, `Education`, `Employment`, `Environmental_Condition`, `Exercise`, `Family_Member`, `Financial_Status`, `Food_Insecurity`, `Gender`, `Geographic_Entity`, `Healthcare_Institution`, `Housing`, `Hyperlipidemia`, `Hypertension`, `Income`, `Insurance_Status`, `Language`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Obesity`, `Other_Disease`, `Other_SDoH_Keywords`, `Population_Group`, `Quality_Of_Life`, `Race_Ethnicity`, `Sexual_Activity`, `Sexual_Orientation`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Spiritual_Beliefs`, `Substance_Duration`, `Substance_Frequency`, `Substance_Quantity`, `Substance_Use`, `Transportation`, `Violence_Or_Abuse`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_en_4.4.3_3.0_1686654976160.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_en_4.4.3_3.0_1686654976160.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_sdoh", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
])
sample_texts = [["""Smith is 55 years old, living in New York, a divorced Mexcian American woman with financial problems. She speaks Spanish and Portuguese. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and cannot access health insurance or paid sick leave. She has a son, a student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reports having her catholic faith as a means of support as well. She has a long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI in April and was due to court this week."""]]
data = spark.createDataFrame(sample_texts).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_sdoh", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
))
val data = Seq("""Smith is 55 years old, living in New York, a divorced Mexcian American woman with financial problems. She speaks Spanish and Portuguese. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and cannot access health insurance or paid sick leave. She has a son, a student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reports having her catholic faith as a means of support as well. She has a long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI in April and was due to court this week.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_FardinSaboori_bert_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_FardinSaboori_bert_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.by_FardinSaboori").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_FardinSaboori_bert_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/FardinSaboori/bert-finetuned-squad
---
layout: model
title: Bangla Bert Embeddings (from Kowsher)
author: John Snow Labs
name: bert_embeddings_bangla_bert
date: 2022-04-11
tags: [bert, embeddings, bn, open_source]
task: Embeddings
language: bn
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bangla-bert` is a Bangla model orginally trained by `Kowsher`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bangla_bert_bn_3.4.2_3.0_1649673360956.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bangla_bert_bn_3.4.2_3.0_1649673360956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bangla_bert","bn") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["আমি স্পার্ক এনএলপি ভালোবাসি"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bangla_bert","bn")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("আমি স্পার্ক এনএলপি ভালোবাসি").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("bn.embed.bangla_bert").predict("""আমি স্পার্ক এনএলপি ভালোবাসি""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bangla_bert|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|bn|
|Size:|615.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Kowsher/bangla-bert
- https://github.com/Kowsher/bert-base-bangla
- https://arxiv.org/abs/1810.04805
- https://github.com/google-research/bert
- https://www.kaggle.com/gakowsher/bangla-language-model-dataset
- https://ssrn.com/abstract=
- http://kowsher.org/
---
layout: model
title: Translate Central Bikol to English Pipeline
author: John Snow Labs
name: translate_bcl_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, bcl, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `bcl`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bcl_en_xx_2.7.0_2.4_1609688692603.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bcl_en_xx_2.7.0_2.4_1609688692603.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_bcl_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_bcl_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.bcl.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_bcl_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Translate English to Seychellois Creole Pipeline
author: John Snow Labs
name: translate_en_crs
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, crs, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `crs`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_crs_xx_2.7.0_2.4_1609691811394.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_crs_xx_2.7.0_2.4_1609691811394.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_crs", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_crs", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.crs').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_crs|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering model (from Wiam)
author: John Snow Labs
name: distilbert_qa_Wiam_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Wiam`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Wiam_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724868151.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Wiam_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724868151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Wiam_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Wiam_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Wiam").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_Wiam_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Wiam/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English RobertaForQuestionAnswering (from deepset)
author: John Snow Labs
name: roberta_qa_roberta_base_squad2_distilled
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-distilled` is a English model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_distilled_en_4.0.0_3.0_1655735282920.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_distilled_en_4.0.0_3.0_1655735282920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_distilled","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_squad2_distilled","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.distilled_base.by_deepset").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_squad2_distilled|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|463.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/deepset/roberta-base-squad2-distilled
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/FARM
- http://www.deepset.ai/jobs
- https://twitter.com/deepset_ai
- https://github.com/deepset-ai/haystack/discussions
- https://github.com/deepset-ai/haystack/
- https://deepset.ai
- https://deepset.ai/germanquad
- https://deepset.ai/german-bert
---
layout: model
title: Mapping Entities (Clinical Drugs) with Corresponding UMLS CUI Codes
author: John Snow Labs
name: umls_clinical_drugs_mapper
date: 2022-07-06
tags: [umls, chunk_mapper, clinical, licensed, en]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps entities (Clinical Drugs) with their corresponding UMLS CUI codes.
## Predicted Entities
`umls_code`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_clinical_drugs_mapper_en_4.0.0_3.0_1657124255341.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_clinical_drugs_mapper_en_4.0.0_3.0_1657124255341.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("clinical_ner")
ner_model_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "clinical_ner"])\
.setOutputCol("ner_chunk")
chunkerMapper = ChunkMapperModel.pretrained("umls_clinical_drugs_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setRels(["umls_code"])\
.setLowerCase(True)
mapper_pipeline = Pipeline().setStages([
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_model_converter,
chunkerMapper])
sample_text="""She was immediately given hydrogen peroxide 30 mg, and has been advised Neosporin Cream for 5 days.
She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg."""
test_data = spark.createDataFrame([[sample_text]]).toDF("text")
result = mapper_pipeline.fit(test_data).transform(test_data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel
.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("clinical_ner")
val ner_model_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "clinical_ner"))
.setOutputCol("ner_chunk")
val chunkerMapper = ChunkMapperModel
.pretrained("umls_clinical_drugs_mapper", "en", "clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("mappings")
.setRels(Array("umls_code"))
val mapper_pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_model_converter,
chunkerMapper))
val test_data = Seq("She was immediately given hydrogen peroxide 30 mg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.").toDF("text")
val result = pipeline.fit(test_data).transform(test_data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.umls_clinical_drugs_mapper").predict("""She was immediately given hydrogen peroxide 30 mg, and has been advised Neosporin Cream for 5 days.
She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.""")
```
## Results
```bash
+-------------------+---------+
|ner_chunk |umls_code|
+-------------------+---------+
|hydrogen peroxide |C0020281 |
|Neosporin Cream |C0132149 |
|magnesium hydroxide|C0024476 |
|metformin |C0025598 |
+-------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|umls_clinical_drugs_mapper|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|23.3 MB|
## References
2022AA UMLS dataset’s Clinical Drug category. https://www.nlm.nih.gov/research/umls/index.html
---
layout: model
title: Detect Clinical Entities (ner_eu_clinical_case - fr)
author: John Snow Labs
name: ner_eu_clinical_case
date: 2023-02-01
tags: [fr, clinical, licensed, ner]
task: Named Entity Recognition
language: fr
edition: Healthcare NLP 4.2.8
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition (NER) deep learning model for extracting clinical entities from French texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN.
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Predicted Entities
`clinical_event`, `bodypart`, `clinical_condition`, `units_measurements`, `patient`, `date_time`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_fr_4.2.8_3.0_1675293960896.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_fr_4.2.8_3.0_1675293960896.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fr")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained('ner_eu_clinical_case', "fr", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["""Un garçon de 3 ans atteint d'un trouble autistique à l'hôpital du service pédiatrique A de l'hôpital universitaire. Il n'a pas d'antécédents familiaux de troubles ou de maladies du spectre autistique. Le garçon a été diagnostiqué avec un trouble de communication sévère, avec des difficultés d'interaction sociale et un traitement sensoriel retardé. Les tests sanguins étaient normaux (thyréostimuline (TSH), hémoglobine, volume globulaire moyen (MCV) et ferritine). L'endoscopie haute a également montré une tumeur sous-muqueuse provoquant une obstruction subtotale de la sortie gastrique. Devant la suspicion d'une tumeur stromale gastro-intestinale, une gastrectomie distale a été réalisée. L'examen histopathologique a révélé une prolifération de cellules fusiformes dans la couche sous-muqueuse."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fr")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_case", "fr", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter))
val data = Seq(Array("""Un garçon de 3 ans atteint d'un trouble autistique à l'hôpital du service pédiatrique A de l'hôpital universitaire. Il n'a pas d'antécédents familiaux de troubles ou de maladies du spectre autistique. Le garçon a été diagnostiqué avec un trouble de communication sévère, avec des difficultés d'interaction sociale et un traitement sensoriel retardé. Les tests sanguins étaient normaux (thyréostimuline (TSH), hémoglobine, volume globulaire moyen (MCV) et ferritine). L'endoscopie haute a également montré une tumeur sous-muqueuse provoquant une obstruction subtotale de la sortie gastrique. Devant la suspicion d'une tumeur stromale gastro-intestinale, une gastrectomie distale a été réalisée. L'examen histopathologique a révélé une prolifération de cellules fusiformes dans la couche sous-muqueuse.""")).toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+-----------------------------------------------------+------------------+
|chunk |ner_label |
+-----------------------------------------------------+------------------+
|Un garçon de 3 ans |patient |
|trouble autistique à l'hôpital du service pédiatrique|clinical_condition|
|l'hôpital |clinical_event |
|Il n'a |patient |
|d'antécédents |clinical_event |
|troubles |clinical_condition|
|maladies |clinical_condition|
|du spectre autistique |bodypart |
|Le garçon |patient |
|diagnostiqué |clinical_event |
|trouble |clinical_condition|
|difficultés |clinical_event |
|traitement |clinical_event |
|tests |clinical_event |
|normaux |units_measurements|
|thyréostimuline |clinical_event |
|TSH |clinical_event |
|ferritine |clinical_event |
|L'endoscopie |clinical_event |
|montré |clinical_event |
|tumeur sous-muqueuse |clinical_condition|
|provoquant |clinical_event |
|obstruction |clinical_condition|
|la sortie gastrique |bodypart |
|suspicion |clinical_event |
|tumeur stromale gastro-intestinale |clinical_condition|
|gastrectomie |clinical_event |
|L'examen |clinical_event |
|révélé |clinical_event |
|prolifération |clinical_event |
|cellules fusiformes |bodypart |
|la couche sous-muqueuse |bodypart |
+-----------------------------------------------------+------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_eu_clinical_case|
|Compatibility:|Healthcare NLP 4.2.8+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|fr|
|Size:|895.0 KB|
## References
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Benchmarking
```bash
label tp fp fn total precision recall f1
date_time 49.0 14.0 70.0 104.0 0.7778 0.7000 0.7368
units_measurements 92.0 19.0 6.0 48.0 0.8288 0.9388 0.8804
clinical_condition 178.0 74.0 73.0 120.0 0.7063 0.7092 0.7078
patient 114.0 6.0 15.0 87.0 0.9500 0.8837 0.9157
clinical_event 265.0 81.0 71.0 478.0 0.7659 0.7887 0.7771
bodypart 243.0 34.0 64.0 166.0 0.8773 0.7915 0.8322
macro - - - - - - 0.8083
micro - - - - - - 0.7978
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from 21iridescent)
author: John Snow Labs
name: distilbert_qa_21iridescent_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `21iridescent`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_21iridescent_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724023073.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_21iridescent_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724023073.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_21iridescent_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_21iridescent_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_21iridescent").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_21iridescent_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/21iridescent/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English AlbertForQuestionAnswering model (from AyushPJ)
author: John Snow Labs
name: albert_qa_ai_club_inductions_21_nlp
date: 2022-06-24
tags: [en, open_source, albert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: AlBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ai-club-inductions-21-nlp-ALBERT` is a English model originally trained by `AyushPJ`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_ai_club_inductions_21_nlp_en_4.0.0_3.0_1656063682959.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_ai_club_inductions_21_nlp_en_4.0.0_3.0_1656063682959.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_ai_club_inductions_21_nlp","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_ai_club_inductions_21_nlp","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.albert.by_AyushPJ").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_qa_ai_club_inductions_21_nlp|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|42.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AyushPJ/ai-club-inductions-21-nlp-ALBERT
---
layout: model
title: Spanish RobertaForQuestionAnswering (from mrm8488)
author: John Snow Labs
name: roberta_qa_RuPERTa_base_finetuned_squadv1
date: 2022-06-20
tags: [es, open_source, question_answering, roberta]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RuPERTa-base-finetuned-squadv1` is a Spanish model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_RuPERTa_base_finetuned_squadv1_es_4.0.0_3.0_1655727321165.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_RuPERTa_base_finetuned_squadv1_es_4.0.0_3.0_1655727321165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_RuPERTa_base_finetuned_squadv1","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_RuPERTa_base_finetuned_squadv1","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squad.ruperta.base.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_RuPERTa_base_finetuned_squadv1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|es|
|Size:|470.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/RuPERTa-base-finetuned-squadv1
---
layout: model
title: Word2Vec Embeddings in Spanish (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, es, open_source]
task: Embeddings
language: es
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_es_3.4.1_3.0_1647459363492.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_es_3.4.1_3.0_1647459363492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Me encanta Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Me encanta Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.embed.w2v_cc_300d").predict("""Me encanta Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|1.3 GB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Sentiment Analysis of French texts
author: John Snow Labs
name: classifierdl_bert_sentiment
date: 2021-09-08
tags: [fr, sentiment, classification, open_source]
task: Sentiment Analysis
language: fr
edition: Spark NLP 3.2.0
spark_version: 2.4
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model identifies the sentiments (positive or negative) in French texts.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_FR/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_Fr_Sentiment.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_fr_3.2.0_2.4_1631104713514.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_fr_3.2.0_2.4_1631104713514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
embeddings = BertSentenceEmbeddings\
.pretrained('labse', 'xx') \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "fr") \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")
fr_sentiment_pipeline = Pipeline(stages=[document, embeddings, sentimentClassifier])
light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
result1 = light_pipeline.annotate("Mignolet vraiment dommage de ne jamais le voir comme titulaire")
result2 = light_pipeline.annotate("Je me sens bien, je suis heureux d'être de retour.")
print(result1["class"], result2["class"], sep = "\n")
```
```scala
val document = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val embeddings = BertSentenceEmbeddings
.pretrained("labse", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence_embeddings")
val sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "fr")
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val fr_sentiment_pipeline = new Pipeline().setStages(Array(document, embeddings, sentimentClassifier))
val light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
val result1 = light_pipeline.annotate("Mignolet vraiment dommage de ne jamais le voir comme titulaire")
val result2 = light_pipeline.annotate("Je me sens bien, je suis heureux d'être de retour.")
```
{:.nlu-block}
```python
import nlu
nlu.load("fr.classify.sentiment.bert").predict("""Mignolet vraiment dommage de ne jamais le voir comme titulaire""")
```
## Results
```bash
['NEGATIVE']
['POSITIVE']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_bert_sentiment|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|fr|
## Data Source
https://github.com/charlesmalafosse/open-dataset-for-sentiment-analysis/
## Benchmarking
```bash
precision recall f1-score support
NEGATIVE 0.82 0.72 0.77 378
POSITIVE 0.92 0.95 0.94 1240
accuracy 0.90 1618
macro avg 0.87 0.84 0.85 1618
weighted avg 0.90 0.90 0.90 1618
```
---
layout: model
title: Legal Confidential Clause Binary Classifier
author: John Snow Labs
name: legclf_confidential_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `confidential` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `confidential`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_confidential_clause_en_1.0.0_3.2_1660122270121.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_confidential_clause_en_1.0.0_3.2_1660122270121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[confidential]|
|[other]|
|[other]|
|[confidential]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_confidential_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
confidential 0.90 0.90 0.90 41
other 0.97 0.97 0.97 127
accuracy - - 0.95 168
macro-avg 0.94 0.94 0.94 168
weighted-avg 0.95 0.95 0.95 168
```
---
layout: model
title: Stop Words Cleaner for Sesotho
author: John Snow Labs
name: stopwords_st
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: st
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, st]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_st_st_2.5.4_2.4_1594742438831.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_st_st_2.5.4_2.4_1594742438831.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_st", "st") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Ntle le ho ba morena oa leboea, John Snow ke ngaka ea Lenyesemane ebile ke moetapele nts'etsopele ea anesthesia le bohloeki ba bongaka.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_st", "st")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Ntle le ho ba morena oa leboea, John Snow ke ngaka ea Lenyesemane ebile ke moetapele nts'etsopele ea anesthesia le bohloeki ba bongaka.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Ntle le ho ba morena oa leboea, John Snow ke ngaka ea Lenyesemane ebile ke moetapele nts'etsopele ea anesthesia le bohloeki ba bongaka."""]
stopword_df = nlu.load('st.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=3, result='Ntle', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=14, end=19, result='morena', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=24, end=29, result='leboea', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=30, end=30, result=',', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=32, end=35, result='John', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_st|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|st|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: English Bert Embeddings Cased model (from Tristan)
author: John Snow Labs
name: bert_embeddings_olm_base_uncased_oct_2022
date: 2023-02-21
tags: [open_source, bert, bert_embeddings, bertformaskedlm, en, tensorflow]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `olm-bert-base-uncased-oct-2022` is a English model originally trained by `Tristan`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_olm_base_uncased_oct_2022_en_4.3.0_3.0_1676999449577.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_olm_base_uncased_oct_2022_en_4.3.0_3.0_1676999449577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_olm_base_uncased_oct_2022","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_olm_base_uncased_oct_2022","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_olm_base_uncased_oct_2022|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|467.5 MB|
|Case sensitive:|true|
## References
https://huggingface.co/Tristan/olm-bert-base-uncased-oct-2022
---
layout: model
title: Detect PHI for Deidentification in Romanian (BERT)
author: John Snow Labs
name: ner_deid_subentity_bert
date: 2022-06-27
tags: [deidentification, bert, phi, ner, ro, licensed]
task: Named Entity Recognition
language: ro
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with `bert_base_cased` embeddings and can detect 17 entities.
This NER model is trained with a combination of custom datasets with several data augmentation mechanisms.
## Predicted Entities
`AGE`, `CITY`, `COUNTRY`, `DATE`, `DOCTOR`, `EMAIL`, `FAX`, `HOSPITAL`, `IDNUM`, `LOCATION-OTHER`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `ZIP`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_bert_ro_4.0.0_3.0_1656311815383.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_bert_ro_4.0.0_3.0_1656311815383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\
.setInputCols(["sentence","token"])\
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_bert", "ro", "clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter])
text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""
data = spark.createDataFrame([[text]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_bert", "ro", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter))
val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""
val data = Seq(text).toDS.toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ro.med_ner.deid.subentity.bert").predict("""
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401""")
```
## Results
```bash
+----------------------------+---------+
|chunk |ner_label|
+----------------------------+---------+
|Spitalul Pentru Ochi de Deal|HOSPITAL |
|Drumul Oprea Nr |STREET |
|Vaslui |CITY |
|737405 |ZIP |
|+40(235)413773 |PHONE |
|25 May 2022 |DATE |
|BUREAN MARIA |PATIENT |
|77 |AGE |
|Agota Evelyn Tımar |DOCTOR |
|2450502264401 |IDNUM |
+----------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_subentity_bert|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ro|
|Size:|16.5 MB|
## References
- Custom John Snow Labs datasets
- Data augmentation techniques
## Benchmarking
```bash
label precision recall f1-score support
AGE 0.98 0.95 0.96 1186
CITY 0.94 0.87 0.90 299
COUNTRY 0.90 0.73 0.81 108
DATE 0.98 0.95 0.96 4518
DOCTOR 0.91 0.94 0.93 1979
EMAIL 1.00 0.62 0.77 8
FAX 0.98 0.95 0.96 56
HOSPITAL 0.92 0.85 0.88 881
IDNUM 0.98 0.99 0.98 235
LOCATION-OTHER 1.00 0.85 0.92 13
MEDICALRECORD 0.99 1.00 1.00 444
ORGANIZATION 0.86 0.76 0.81 75
PATIENT 0.91 0.87 0.89 937
PHONE 0.96 0.98 0.97 302
PROFESSION 0.85 0.82 0.83 161
STREET 0.96 0.94 0.95 173
ZIP 0.99 0.98 0.99 138
micro-avg 0.95 0.93 0.94 11513
macro-avg 0.95 0.89 0.91 11513
weighted-avg 0.95 0.93 0.94 11513
```
---
layout: model
title: English BertForQuestionAnswering model (from SauravMaheshkar)
author: John Snow Labs
name: bert_qa_bert_base_cased_chaii
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-chaii` is a English model orginally trained by `SauravMaheshkar`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_chaii_en_4.0.0_3.0_1654179712101.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_chaii_en_4.0.0_3.0_1654179712101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_cased_chaii","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_cased_chaii","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.chaii.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_cased_chaii|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/SauravMaheshkar/bert-base-cased-chaii
---
layout: model
title: Legal Arguments Mining in Court Decisions
author: John Snow Labs
name: legclf_argument_mining
date: 2023-03-26
tags: [en, classification, licensed, legal, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Multiclass classification model which classifies arguments in legal discourse. These are the following classes: `subsumption`, `definition`, `conclusion`, `other`.
## Predicted Entities
`subsumption`, `definition`, `conclusion`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_argument_mining_en_1.0.0_3.0_1679829561976.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_argument_mining_en_1.0.0_3.0_1679829561976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
embeddingsSentence = nlp.SentenceEmbeddings()\
.setInputCols(["document", "embeddings"])\
.setOutputCol("sentence_embeddings")\
.setPoolingStrategy("AVERAGE")
docClassifier = legal.ClassifierDLModel.pretrained("legclf_argument_mining","en", "legal/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
tokenizer,
embeddings,
embeddingsSentence,
docClassifier
])
df = spark.createDataFrame([["There is therefore no doubt – and the Government do not contest – that the measures concerned in the present case ( the children 's continued placement in foster homes and the restrictions imposed on contact between the applicants and their children ) amounts to an “ interference ” with the applicants ' rights to respect for their family life ."]]).toDF("text")
model = nlpPipeline.fit(df)
result = model.transform(df)
result.select("text", "category.result").show()
```
## Results
```bash
+--------------------+-------------+
| text| result|
+--------------------+-------------+
|There is therefor...|[subsumption]|
+--------------------+-------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_argument_mining|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.2 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/MeilingShi/legal_argument_mining)
## Benchmarking
```bash
label precision recall f1-score support
conclusion 0.93 0.79 0.85 52
definition 0.87 0.81 0.84 58
other 0.88 0.88 0.88 57
subsumption 0.64 0.79 0.71 52
accuracy - - 0.82 219
macro-avg 0.83 0.82 0.82 219
weighted-avg 0.83 0.82 0.82 219
```
---
layout: model
title: Chinese BertForMaskedLM Cased model (from hfl)
author: John Snow Labs
name: bert_embeddings_hfl_chinese_roberta_wwm_ext
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-roberta-wwm-ext` is a Chinese model originally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_hfl_chinese_roberta_wwm_ext_zh_4.2.4_3.0_1670021322707.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_hfl_chinese_roberta_wwm_ext_zh_4.2.4_3.0_1670021322707.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_hfl_chinese_roberta_wwm_ext","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_hfl_chinese_roberta_wwm_ext","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_hfl_chinese_roberta_wwm_ext|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|383.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/chinese-roberta-wwm-ext
- https://arxiv.org/abs/1906.08101
- https://github.com/google-research/bert
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/1906.08101
---
layout: model
title: English BertForQuestionAnswering model (from peterhsu)
author: John Snow Labs
name: bert_qa_peterhsu_bert_finetuned_squad_accelerate
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model orginally trained by `peterhsu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_peterhsu_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535878232.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_peterhsu_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535878232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_peterhsu_bert_finetuned_squad_accelerate","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_peterhsu_bert_finetuned_squad_accelerate","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.accelerate.by_peterhsu").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_peterhsu_bert_finetuned_squad_accelerate|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/peterhsu/bert-finetuned-squad-accelerate
---
layout: model
title: Fast Neural Machine Translation Model from Estonian to English
author: John Snow Labs
name: opus_mt_et_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, et, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `et`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_et_en_xx_2.7.0_2.4_1609170283601.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_et_en_xx_2.7.0_2.4_1609170283601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_et_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_et_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.et.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_et_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_base_timit_moaiz_exp2 TFWav2Vec2ForCTC from moaiz237
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_moaiz_exp2
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_moaiz_exp2` is a English model originally trained by moaiz237.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_moaiz_exp2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_moaiz_exp2_en_4.2.0_3.0_1664037629984.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_moaiz_exp2_en_4.2.0_3.0_1664037629984.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_moaiz_exp2', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_moaiz_exp2", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_moaiz_exp2|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations)
author: John Snow Labs
name: legner_mapa
date: 2023-04-28
tags: [ga, licensed, ner, legal, mapa]
task: Named Entity Recognition
language: ga
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Irish` documents.
## Predicted Entities
`ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_ga_1.0.0_3.0_1682670223837.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_ga_1.0.0_3.0_1682670223837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_irish_legal","gle")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "ga", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""Dhiúltaigh Tribunale di Teramo ( An Chúirt Dúiche, Teramo ) an t-iarratas a rinne Bn.Grigorescu, ar bhonn teagmhasach, chun aitheantas a thabhairt san Iodáil do bhreithiúnas colscartha Tribunalul București ( An Chúirt Réigiúnach, Búcairist ) an 3 Nollaig 2012, de bhun Rialachán Uimh."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|Teramo |ADDRESS |
|Bn.Grigorescu |PERSON |
|Búcairist |ADDRESS |
|3 Nollaig 2012|DATE |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_mapa|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ga|
|Size:|16.3 MB|
## References
The dataset is available [here](https://huggingface.co/datasets/joelito/mapa).
## Benchmarking
```bash
label precision recall f1-score support
ADDRESS 0.82 0.74 0.78 19
AMOUNT 1.00 1.00 1.00 7
DATE 0.91 0.92 0.91 75
ORGANISATION 0.65 0.67 0.66 48
PERSON 0.71 0.82 0.76 56
micro-avg 0.79 0.82 0.80 205
macro-avg 0.82 0.83 0.82 205
weighted-avg 0.79 0.82 0.80 205
```
---
layout: model
title: English asr_distil_wav2vec2 TFWav2Vec2ForCTC from OthmaneJ
author: John Snow Labs
name: asr_distil_wav2vec2
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_distil_wav2vec2` is a English model originally trained by OthmaneJ.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_distil_wav2vec2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_distil_wav2vec2_en_4.2.0_3.0_1664020967214.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_distil_wav2vec2_en_4.2.0_3.0_1664020967214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_distil_wav2vec2", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_distil_wav2vec2", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_distil_wav2vec2|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|188.9 MB|
---
layout: model
title: Legal BERT Base Uncased Embedding
author: John Snow Labs
name: bert_base_uncased_legal
date: 2021-09-07
tags: [english, legal, open_source, bert_embeddings, uncased, en]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.2.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEGAL-BERT is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications. To pre-train the different variations of LEGAL-BERT, we collected 12 GB of diverse English legal text from several fields (e.g., legislation, court cases, contracts) scraped from publicly available resources. Sub-domains variants (CONTRACTS-, EURLEX-, ECHR-) and/or general LEGAL-BERT perform better than using BERT out of the box for domain-specific tasks. A light-weight model (33% the size of BERT-BASE) pre-trained from scratch on legal data with competitive perfomance is also available.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_uncased_legal_en_3.2.2_3.0_1630999701913.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_uncased_legal_en_3.2.2_3.0_1630999701913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = BertEmbeddings.pretrained("bert_base_uncased_legal", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
```
```scala
val embeddings = BertEmbeddings.pretrained("bert_base_uncased_legal", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.bert.base_uncased_legal").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_uncased_legal|
|Compatibility:|Spark NLP 3.2.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Case sensitive:|true|
## Data Source
The model is imported from: https://huggingface.co/nlpaueb/legal-bert-base-uncased
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab240 TFWav2Vec2ForCTC from hassnain
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab240
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab240` is a English model originally trained by hassnain.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab240_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab240_en_4.2.0_3.0_1664023921297.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab240_en_4.2.0_3.0_1664023921297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab240", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab240", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab240|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: Turkish Named Entity Recognition (from akdeniz27)
author: John Snow Labs
name: bert_ner_bert_base_turkish_cased_ner
date: 2022-05-09
tags: [bert, ner, token_classification, tr, open_source]
task: Named Entity Recognition
language: tr
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-turkish-cased-ner` is a Turkish model orginally trained by `akdeniz27`.
## Predicted Entities
`LOC`, `PER`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_turkish_cased_ner_tr_3.4.2_3.0_1652099217326.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_turkish_cased_ner_tr_3.4.2_3.0_1652099217326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_turkish_cased_ner","tr") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_turkish_cased_ner","tr")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Spark NLP'yi seviyorum").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_bert_base_turkish_cased_ner|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|tr|
|Size:|412.9 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/akdeniz27/bert-base-turkish-cased-ner
- https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
- https://ieeexplore.ieee.org/document/7495744
---
layout: model
title: Sentence Entity Resolver for ICD10-CM (Augmented)
author: John Snow Labs
name: sbiobertresolve_icd10cm_augmented
date: 2022-01-18
tags: [icd10cm, entity_resolution, clinical, en, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.1
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, it has been augmented with synonyms for making it more accurate.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_2.4_1642532480732.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_2.4_1642532480732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(['PROBLEM'])
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver])
data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text")
results = nlpPipeline.fit(data_ner).transform(data_ner)
```
```scala
...
val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10cm.augmented").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""")
```
## Results
```bash
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ner_chunk| entity|icd10cm_code| resolutions| all_codes|
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| gestational diabetes mellitus|PROBLEM| O2441|gestational diabetes mellitus:::postpartum gestational diabetes mel...| O2441:::O2443:::Z8632:::Z875:::O2431:::O2411:::O244:::O241:::O2481|
|subsequent type two diabetes mellitus|PROBLEM| O2411|pre-existing type 2 diabetes mellitus:::disorder associated with ty...|O2411:::E118:::E11:::E139:::E119:::E113:::E1144:::Z863:::Z8639:::E1...|
| T2DM|PROBLEM| E11|type 2 diabetes mellitus:::disorder associated with type 2 diabetes...|E11:::E118:::E119:::O2411:::E109:::E139:::E113:::E8881:::Z833:::D64...|
| HTG-induced pancreatitis|PROBLEM| K8520|alcohol-induced pancreatitis:::drug-induced acute pancreatitis:::he...|K8520:::K853:::K8590:::F102:::K852:::K859:::K8580:::K8591:::K858:::...|
| acute hepatitis|PROBLEM| K720|acute hepatitis:::acute hepatitis a:::acute infectious hepatitis:::...|K720:::B15:::B179:::B172:::Z0389:::B159:::B150:::B16:::K752:::K712:...|
| obesity|PROBLEM| E669|obesity:::abdominal obesity:::obese:::central obesity:::overweight ...|E669:::E668:::Z6841:::Q130:::E66:::E6601:::Z8639:::E349:::H3550:::Z...|
| a body mass index|PROBLEM| Z6841|finding of body mass index:::observation of body mass index:::mass ...|Z6841:::E669:::R229:::Z681:::R223:::R221:::Z68:::R222:::R220:::R418...|
| polyuria|PROBLEM| R35|polyuria:::nocturnal polyuria:::polyuric state:::polyuric state (di...|R35:::R3581:::R358:::E232:::R31:::R350:::R8299:::N401:::E723:::O048...|
| polydipsia|PROBLEM| R631|polydipsia:::psychogenic polydipsia:::primary polydipsia:::psychoge...|R631:::F6389:::E232:::F639:::O40:::G475:::M7989:::R632:::R061:::H53...|
| poor appetite|PROBLEM| R630|poor appetite:::poor feeding:::bad taste in mouth:::unpleasant tast...|R630:::P929:::R438:::R432:::E86:::R196:::F520:::Z724:::R0689:::Z768...|
| vomiting|PROBLEM| R111|vomiting:::intermittent vomiting:::vomiting symptoms:::periodic vom...| R111:::R11:::R1110:::G43A1:::P921:::P9209:::G43A:::R1113:::R110|
| a respiratory tract infection|PROBLEM| J988|respiratory tract infection:::upper respiratory tract infection:::b...|J988:::J069:::A499:::J22:::J209:::Z593:::T17:::J0410:::Z1383:::J189...|
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10cm_augmented|
|Compatibility:|Healthcare NLP 3.3.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icd10cm_code]|
|Language:|en|
|Size:|1.4 GB|
|Case sensitive:|false|
|Dependencies:|embeddings_clinical|
## Data Source
Trained on ICD10CM 2022 Codes dataset: https://www.cdc.gov/nchs/icd/icd10cm.htm
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265911` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911_en_4.0.0_3.0_1655985889976.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911_en_4.0.0_3.0_1655985889976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265911").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|888.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265911
---
layout: model
title: Legal Return Of Confidential Information Clause Binary Classifier
author: John Snow Labs
name: legclf_return_of_conf_info_clause
date: 2023-02-13
tags: [en, legal, classification, return, confidential, information, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `return_of_conf_info` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`return_of_conf_info`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_return_of_conf_info_clause_en_1.0.0_3.0_1676304098427.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_return_of_conf_info_clause_en_1.0.0_3.0_1676304098427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[limited-liability-company-agreement]|
|[other]|
|[other]|
|[limited-liability-company-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_limited_liability_company_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
limited-liability-company-agreement 0.98 0.98 0.98 121
other 0.99 0.99 0.99 204
accuracy - - 0.98 325
macro-avg 0.98 0.98 0.98 325
weighted-avg 0.98 0.98 0.98 325
```
---
layout: model
title: NER Pipeline for 10 High Resourced Languages
author: John Snow Labs
name: xlm_roberta_large_token_classifier_hrl_pipeline
date: 2022-06-27
tags: [arabic, german, english, spanish, french, italian, latvian, dutch, portuguese, chinese, xlm, roberta, ner, xx, open_source]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [xlm_roberta_large_token_classifier_hrl](https://nlp.johnsnowlabs.com/2021/12/26/xlm_roberta_large_token_classifier_hrl_xx.html) model.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_HRL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_HRL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_hrl_pipeline_xx_4.0.0_3.0_1656371823877.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_hrl_pipeline_xx_4.0.0_3.0_1656371823877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("xlm_roberta_large_token_classifier_hrl_pipeline", lang = "xx")
pipeline.annotate("يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.")
```
```scala
val pipeline = new PretrainedPipeline("xlm_roberta_large_token_classifier_hrl_pipeline", lang = "xx")
pipeline.annotate("يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.")
```
## Results
```bash
+---------------------------+---------+
|chunk |ner_label|
+---------------------------+---------+
|الرياض |LOC |
|فيصل بن بندر بن عبد العزيز |PER |
|الرياض |LOC |
+---------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_large_token_classifier_hrl_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|xx|
|Size:|1.8 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- XlmRoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Explain Document Pipeline for Finnish
author: John Snow Labs
name: explain_document_sm
date: 2021-03-22
tags: [open_source, finnish, explain_document_sm, pipeline, fi]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: fi
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_fi_3.0.0_3.0_1616429037499.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_fi_3.0.0_3.0_1616429037499.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_sm', lang = 'fi')
annotations = pipeline.fullAnnotate(""Hei John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_sm", lang = "fi")
val result = pipeline.fullAnnotate("Hei John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hei John Snow Labs! ""]
result_df = nlu.load('fi.explain').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:-------------------------|:------------------------|:---------------------------------|:---------------------------------|:------------------------------------|:-----------------------------|:---------------------------------|:--------------------|
| 0 | ['Hei John Snow Labs! '] | ['Hei John Snow Labs!'] | ['Hei', 'John', 'Snow', 'Labs!'] | ['hei', 'John', 'Snow', 'Labs!'] | ['INTJ', 'PROPN', 'PROPN', 'PROPN'] | [[-0.394499987363815,.,...]] | ['O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_sm|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
---
layout: model
title: Sentence Entity Resolver for RxNorm (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_rxnorm_augmented
date: 2022-01-03
tags: [rxnorm, licensed, en, clinical, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.1
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It trained on the augmented version of the dataset which is used in previous RxNorm resolver models. Additionally, this model returns concept classes of the drugs in `all_k_aux_labels` column.
## Predicted Entities
`RxNorm Codes`, `Concept Classes`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_augmented_en_3.3.1_2.4_1641241820334.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_augmented_en_3.3.1_2.4_1641241820334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_rxnorm_augmented``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_posology``` as NER model. ```DRUG``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli", "en","clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
rxnorm_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\
.setInputCols("sbert_embeddings")\
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
rxnorm_pipeline = Pipeline(stages = [
documentAssembler,
sbert_embedder,
rxnorm_resolver])
model = rxnorm_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)
result = light_model.fullAnnotate(["Coumadin 5 mg", "aspirin", "avandia 4 mg"])
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli", "en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val rxnorm_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_rxnorm_augmented_cased", "en", "clinical/models")
.setInputCols("sbert_embeddings")
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val rxnorm_pipelineModel = new PipelineModel().setStages(Array(documentAssembler,
sbert_embedder,
rxnorm_resolver))
val data = Seq(Array("Coumadin 5 mg", "aspirin", ""avandia 4 mg")).toDS.toDF("text")
val result= rxnorm_pipelineModel.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.rxnen.med_ner.deid_subentityorm_augmented").predict("""Coumadin 5 mg""")
```
## Results
```bash
| | RxNormCode | Resolution | all_k_results | all_k_distances | all_k_cosine_distances | all_k_resolutions | all_k_aux_labels |
|---:|-------------:|:-----------------------------------------|:----------------------------------|:----------------------------------|:----------------------------------|:----------------------------------------------------------------|:----------------------------------|
| 0 | 855333 | warfarin sodium 5 MG [Coumadin] | 855333:::432467:::438740:::103... | 3.0367:::4.7790:::4.7790:::5.3... | 0.0161:::0.0395:::0.0395:::0.0... | warfarin sodium 5 MG [Coumadin]:::coumarin 5 MG Oral Tablet:... | Branded Drug Comp:::Clinical D... |
| 1 | 1537020 | aspirin Effervescent Oral Tablet | 1537020:::1191:::1295740:::405... | 0.0000:::0.0000:::4.1826:::5.7... | 0.0000:::0.0000:::0.0292:::0.0... | aspirin Effervescent Oral Tablet:::aspirin:::aspirin Oral Po... | Clinical Drug Form:::Ingredien... |
| 2 | 261242 | rosiglitazone 4 MG Oral Tablet [Avandia] | 261242:::810073:::153845:::109... | 0.0000:::4.7482:::5.0125:::5.2... | 0.0000:::0.0365:::0.0409:::0.0... | rosiglitazone 4 MG Oral Tablet [Avandia]:::fesoterodine fuma... | Branded Drug:::Branded Drug Co... |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_rxnorm_augmented|
|Compatibility:|Healthcare NLP 3.3.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[rxnorm_code]|
|Language:|en|
|Size:|976.1 MB|
|Case sensitive:|false|
---
layout: model
title: Fast Neural Machine Translation Model from Basque to English
author: John Snow Labs
name: opus_mt_eu_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, eu, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `eu`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_eu_en_xx_2.7.0_2.4_1609166590644.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_eu_en_xx_2.7.0_2.4_1609166590644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_eu_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_eu_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.eu.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_eu_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Translate English to French Pipeline
author: John Snow Labs
name: translate_en_fr
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, fr, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `fr`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_fr_xx_2.7.0_2.4_1609684801803.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_fr_xx_2.7.0_2.4_1609684801803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_fr", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_fr", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.fr').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_fr|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Word2Vec Embeddings in Romanian (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, ro, open_source]
task: Embeddings
language: ro
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ro_3.4.1_3.0_1647454014729.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ro_3.4.1_3.0_1647454014729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Îmi place Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Îmi place Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ro.embed.w2v_cc_300d").predict("""Îmi place Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|ro|
|Size:|1.2 GB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739549310.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739549310.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base_rule_based_hier_triplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|460.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0
---
layout: model
title: Part of Speech for Korean
author: John Snow Labs
name: pos_ud_kaist
date: 2021-03-09
tags: [part_of_speech, open_source, korean, pos_ud_kaist, ko]
task: Part of Speech Tagging
language: ko
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- CCONJ
- ADV
- SCONJ
- DET
- NOUN
- VERB
- ADJ
- PUNCT
- AUX
- PRON
- PROPN
- NUM
- INTJ
- PART
- X
- ADP
- SYM
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_kaist_ko_3.0.0_3.0_1615292391244.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_kaist_ko_3.0.0_3.0_1615292391244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_kaist", "ko") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['John Snow Labs에서 안녕하세요! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_kaist", "ko")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("John Snow Labs에서 안녕하세요! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""John Snow Labs에서 안녕하세요! ""]
token_df = nlu.load('ko.pos.ud_kaist').predict(text)
token_df
```
## Results
```bash
token pos
0 J NOUN
1 o NOUN
2 h NOUN
3 n SCONJ
4 S X
5 n X
6 o X
7 w X
8 L X
9 a X
10 b X
11 s X
12 에 ADP
13 서 SCONJ
14 안 ADV
15 녕 VERB
16 하세요 VERB
17 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_kaist|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|ko|
---
layout: model
title: Multilingual XLMRobertaForTokenClassification Base Cased model (from cj-mills)
author: John Snow Labs
name: xlmroberta_ner_cj_mills_base_finetuned_panx_all
date: 2022-08-13
tags: [xx, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `cj-mills`.
## Predicted Entities
`ORG`, `LOC`, `PER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cj_mills_base_finetuned_panx_all_xx_4.1.0_3.0_1660427899930.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cj_mills_base_finetuned_panx_all_xx_4.1.0_3.0_1660427899930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cj_mills_base_finetuned_panx_all","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cj_mills_base_finetuned_panx_all","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_cj_mills_base_finetuned_panx_all|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|860.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-all
---
layout: model
title: Pipeline to Detect PHI for Generic Deidentification in Romanian (BERT)
author: John Snow Labs
name: ner_deid_generic_bert_pipeline
date: 2023-03-09
tags: [licensed, clinical, ro, deidentification, phi, generic, bert]
task: Named Entity Recognition
language: ro
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_generic_bert](https://nlp.johnsnowlabs.com/2022/11/22/ner_deid_generic_bert_ro.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_pipeline_ro_4.3.0_3.2_1678352946195.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_pipeline_ro_4.3.0_3.2_1678352946195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_generic_bert_pipeline", "ro", "clinical/models")
text = '''Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_deid_generic_bert_pipeline", "ro", "clinical/models")
val text = "Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:-----------------------------|--------:|------:|:------------|-------------:|
| 0 | Spitalul Pentru Ochi de Deal | 0 | 27 | LOCATION | 0.99352 |
| 1 | Drumul Oprea Nr. 972 | 30 | 49 | LOCATION | 0.99994 |
| 2 | Vaslui | 51 | 56 | LOCATION | 1 |
| 3 | 737405 | 59 | 64 | LOCATION | 1 |
| 4 | +40(235)413773 | 79 | 92 | CONTACT | 1 |
| 5 | 25 May 2022 | 119 | 129 | DATE | 1 |
| 6 | si | 145 | 146 | NAME | 0.9998 |
| 7 | BUREAN MARIA | 158 | 169 | NAME | 0.9993 |
| 8 | 77 | 180 | 181 | AGE | 1 |
| 9 | Agota Evelyn Tımar | 191 | 210 | NAME | 0.859975 |
| | C | | | | |
| 10 | 2450502264401 | 218 | 230 | ID | 1 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_generic_bert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|ro|
|Size:|483.8 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265905` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905_en_4.0.0_3.0_1655985226687.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905_en_4.0.0_3.0_1655985226687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265905").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|888.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265905
---
layout: model
title: Fast Neural Machine Translation Model from English to Afro-Asiatic Languages
author: John Snow Labs
name: opus_mt_en_afa
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, afa, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `afa`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_afa_xx_2.7.0_2.4_1609169665906.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_afa_xx_2.7.0_2.4_1609169665906.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_afa", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_afa", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.afa').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_afa|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for RxNorm (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_rxnorm
date: 2021-05-16
tags: [entity_resolution, clinical, licensed, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.4
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements.
## Predicted Entities
Predicts RxNorm Codes and their normalized definition for each chunk.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_RXNORM/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_en_3.0.4_3.0_1636395903630.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_en_3.0.4_3.0_1636395903630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
chunk2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver])
data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("entities")
val chunk2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("ner_chunk_doc")
.setOutputCol("sbert_embeddings")
val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm","en", "clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver))
val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.rxnorm").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""")
```
## Results
```bash
+--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+
| chunk|begin|end| entity| code|confidence| resolutions| codes|
+--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+
| hypertension| 68| 79| PROBLEM| 386165| 0.1567|hypercal:::hypersed:::hypertears:::hyperstat...|386165:::217667::...|
|chronic renal ins...| 83|109| PROBLEM| 218689| 0.1036|nephro calci:::dialysis solutions:::creatini...|218689:::3310:::2...|
| COPD| 113|116| PROBLEM|1539999| 0.1644|broncomar dm:::acne medication:::carbon mono...|1539999:::214981:...|
| gastritis| 120|128| PROBLEM| 225965| 0.1983|gastroflux:::gastroflux oral product:::uceri...|225965:::1176661:...|
| TIA| 136|138| PROBLEM|1089812| 0.0625|thera tears:::thiotepa injection:::nature's ...|1089812:::1660003...|
|a non-ST elevatio...| 182|202| PROBLEM| 218767| 0.1007|non-aspirin pm:::aspirin-free:::non aspirin ...|218767:::215440::...|
|Guaiac positive s...| 208|229| PROBLEM|1294361| 0.0820|anusol rectal product:::anusol hc rectal pro...|1294361:::1166715...|
|cardiac catheteri...| 295|317| TEST| 385247| 0.1566|cardiacap:::cardiology pack:::cardizem:::car...|385247:::545063::...|
| PTCA| 324|327|TREATMENT| 8410| 0.0867|alteplase:::reteplase:::pancuronium:::tripe ...|8410:::76895:::78...|
| mid LAD lesion| 332|345| PROBLEM| 151672| 0.0549|dulcolax:::lazerformalyde:::linaclotide:::du...|151672:::217985::...|
+--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_rxnorm|
|Compatibility:|Healthcare NLP 3.0.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk, drugs_sbert_embeddings]|
|Output Labels:|[rxnorm_code]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on November 2020 RxNorm Clinical Drugs ontology graph with ``sbiobert_base_cased_mli`` embeddings.
https://www.nlm.nih.gov/pubs/techbull/nd20/brief/nd20_rx_norm_november_release.html
---
layout: model
title: Stopwords Remover for Marathi language (187 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, mr, open_source]
task: Stop Words Removal
language: mr
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_mr_3.4.1_3.0_1646672300971.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_mr_3.4.1_3.0_1646672300971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","mr") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["आपण माझ्यापेक्षा चांगले नाही"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","mr")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("आपण माझ्यापेक्षा चांगले नाही").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("mr.stopwords").predict("""आपण माझ्यापेक्षा चांगले नाही""")
```
## Results
```bash
+----------------------------+
|result |
+----------------------------+
|[माझ्यापेक्षा, चांगले, नाही]|
+----------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|mr|
|Size:|2.1 KB|
---
layout: model
title: English T5ForConditionalGeneration Cased model (from ThomasNLG)
author: John Snow Labs
name: t5_qg_webnlg_synth
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-qg_webnlg_synth-en` is a English model originally trained by `ThomasNLG`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_qg_webnlg_synth_en_4.3.0_3.0_1675125600977.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_qg_webnlg_synth_en_4.3.0_3.0_1675125600977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_qg_webnlg_synth","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_qg_webnlg_synth","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_qg_webnlg_synth|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|280.8 MB|
## References
- https://huggingface.co/ThomasNLG/t5-qg_webnlg_synth-en
- https://github.com/ThomasScialom/QuestEval
- https://arxiv.org/abs/2104.07555
---
layout: model
title: English asr_wav2vec2_xlsr_53_phon TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: pipeline_asr_wav2vec2_xlsr_53_phon
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_53_phon` is a English model originally trained by facebook.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_53_phon_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_53_phon_en_4.2.0_3.0_1664109509538.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_53_phon_en_4.2.0_3.0_1664109509538.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_53_phon', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_53_phon", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xlsr_53_phon|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|756.9 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Hindi asr_Wav2Vec2_xls_r_lm_300m TFWav2Vec2ForCTC from LegolasTheElf
author: John Snow Labs
name: asr_Wav2Vec2_xls_r_lm_300m
date: 2022-09-26
tags: [wav2vec2, hi, audio, open_source, asr]
task: Automatic Speech Recognition
language: hi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_xls_r_lm_300m` is a Hindi model originally trained by LegolasTheElf.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Wav2Vec2_xls_r_lm_300m_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_xls_r_lm_300m_hi_4.2.0_3.0_1664190519147.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_xls_r_lm_300m_hi_4.2.0_3.0_1664190519147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_Wav2Vec2_xls_r_lm_300m", "hi")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_Wav2Vec2_xls_r_lm_300m", "hi")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_Wav2Vec2_xls_r_lm_300m|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|hi|
|Size:|1.2 GB|
---
layout: model
title: Catalan RobertaForQuestionAnswering (from projecte-aina)
author: John Snow Labs
name: roberta_qa_roberta_base_ca_cased_qa
date: 2022-06-20
tags: [ca, open_source, question_answering, roberta]
task: Question Answering
language: ca
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-ca-cased-qa` is a Catalan model originally trained by `projecte-aina`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_ca_cased_qa_ca_4.0.0_3.0_1655730281795.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_ca_cased_qa_ca_4.0.0_3.0_1655730281795.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_ca_cased_qa","ca") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_ca_cased_qa","ca")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_ca_cased_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|ca|
|Size:|451.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/projecte-aina/roberta-base-ca-cased-qa
- https://arxiv.org/abs/1907.11692
- https://github.com/projecte-aina/club
---
layout: model
title: Relation extraction between Drugs and ADE (ReDL)
author: John Snow Labs
name: redl_ade_biobert
date: 2021-07-12
tags: [relation_extraction, en, clinical, licensed, ade, biobert]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 3.1.2
spark_version: 3.0
supported: true
annotator: RelationExtractionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is an end-to-end trained BioBERT model, capable of Relating Drugs and adverse reactions caused by them; It predicts if an adverse event is caused by a drug or not. `1` : Shows the adverse event and drug entities are related, `0` : Shows the adverse event and drug entities are not related.
## Predicted Entities
`0`, `1`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ADE/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/RE_ADE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_ade_biobert_en_3.1.2_3.0_1626105541347.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_ade_biobert_en_3.1.2_3.0_1626105541347.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = sparknlp.annotators.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = NerConverter() \
.setInputCols(["sentences", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
dependency_parser = sparknlp.annotators.DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
# Set a filter on pairs of named entities which will be treated as relation candidates
re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setMaxSyntacticDistance(10)\
.setOutputCol("re_ner_chunks")\
.setRelationPairs(['ade-drug', 'drug-ade'])
# The dataset this model is trained to is sentence-wise.
# This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
re_model = RelationExtractionDLModel()\
.pretrained('redl_ade_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter,
sentencer,
tokenizer,
pos_tagger,
words_embedder,
ner_tagger,
ner_converter,
dependency_parser,
re_ner_chunk_filter,
re_model])
light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
text ="""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps."""
annotations = light_pipeline.fullAnnotate(text)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverter()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
.setRelationPairs(Array("drug-ade", "ade-drug"))
// The dataset this model is trained to is sentence-wise.
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val re_model = RelationExtractionDLModel()
.pretrained("redl_ade_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter,
sentencer,
tokenizer,
words_embedder,
ner_tagger,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model))
val data = Seq("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.adverse_drug_events.clinical.biobert").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""")
```
## Results
```bash
| relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence |
|-----------:|:----------|----------------:|--------------:|:----------|:----------|----------------:|--------------:|:---------------|-------------:|
| 1 | DRUG | 12 | 18 | Lipitor | ADE | 52 | 65 | severe fatigue | 0.998156 |
| 1 | DRUG | 97 | 105 | voltarene | ADE | 144 | 156 | muscle cramps | 0.985513 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_ade_biobert|
|Compatibility:|Healthcare NLP 3.1.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[redl_ner_chunks, document]|
|Output Labels:|[relations]|
|Language:|en|
## Data Source
This model is trained on custom data annotated by JSL.
## Benchmarking
```bash
label Recall Precision F1 Support
0 0.829 0.895 0.861 1146
1 0.955 0.923 0.939 2454
Avg. 0.892 0.909 0.900 -
Weighted-Avg. 0.915 0.914 0.914 -
```
---
layout: model
title: German asr_wav2vec2_large_xlsr_53_german_by_marcel TFWav2Vec2ForCTC from marcel
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_53_german_by_marcel
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_marcel` is a German model originally trained by marcel.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_german_by_marcel_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_marcel_de_4.2.0_3.0_1664101884066.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_marcel_de_4.2.0_3.0_1664101884066.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_53_german_by_marcel", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_53_german_by_marcel", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_53_german_by_marcel|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0_en_4.0.0_3.0_1657183896456.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0_en_4.0.0_3.0_1657183896456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-0
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-2` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2_en_4.0.0_3.0_1654191544720.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2_en_4.0.0_3.0_1654191544720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.span_bert.base_cased_32d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|376.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-2
---
layout: model
title: English BertForQuestionAnswering Tiny Cased model (from mrm8488)
author: John Snow Labs
name: bert_qa_tiny_wrslb_finetuned_squadv1
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-wrslb-finetuned-squadv1` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tiny_wrslb_finetuned_squadv1_en_4.0.0_3.0_1657188696480.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tiny_wrslb_finetuned_squadv1_en_4.0.0_3.0_1657188696480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tiny_wrslb_finetuned_squadv1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_tiny_wrslb_finetuned_squadv1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_tiny_wrslb_finetuned_squadv1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|16.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/bert-tiny-wrslb-finetuned-squadv1
---
layout: model
title: Fast Neural Machine Translation Model from English to Hindi
author: John Snow Labs
name: opus_mt_en_hi
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, hi, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `hi`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_hi_xx_2.7.0_2.4_1609169231360.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_hi_xx_2.7.0_2.4_1609169231360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_hi", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_hi", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.hi').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_hi|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Russian Named Entity Recognition (from IlyaGusev)
author: John Snow Labs
name: bert_ner_rubertconv_toxic_editor
date: 2022-05-09
tags: [bert, ner, token_classification, ru, open_source]
task: Named Entity Recognition
language: ru
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `rubertconv_toxic_editor` is a Russian model orginally trained by `IlyaGusev`.
## Predicted Entities
`equal`, `replace`, `delete`, `insert`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_rubertconv_toxic_editor_ru_3.4.2_3.0_1652099038495.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_rubertconv_toxic_editor_ru_3.4.2_3.0_1652099038495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_rubertconv_toxic_editor","ru") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_rubertconv_toxic_editor","ru")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Я люблю Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_rubertconv_toxic_editor|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ru|
|Size:|662.8 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/IlyaGusev/rubertconv_toxic_editor
- https://colab.research.google.com/drive/1NUSO1QGlDgD-IWXa2SpeND089eVxrCJW
- https://github.com/skoltech-nlp/russe_detox_2022/tree/main/data
---
layout: model
title: Aspect based Sentiment Analysis for restaurant reviews
author: John Snow Labs
name: ner_aspect_based_sentiment
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Automatically detect positive, negative and neutral aspects about restaurants from user reviews. Instead of labelling the entire review as negative or positive, this model helps identify which exact phrases relate to sentiment identified in the review.
## Predicted Entities
`NEG`, `POS`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/ASPECT_BASED_SENTIMENT_RESTAURANT/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_aspect_based_sentiment_en_3.0.0_3.0_1617209723737.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_aspect_based_sentiment_en_3.0.0_3.0_1617209723737.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", "xx")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_aspect_based_sentiment")\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("entities")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish."]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", "xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_aspect_based_sentiment")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("entities")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter))
val data = Seq("Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.aspect_sentiment").predict("""Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish.""")
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+-------------------+-----------+
| sentence | aspect | sentiment |
+----------------------------------------------------------------------------------------------------+-------------------+-----------+
| We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. | Thai-style main | positive |
| We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. | lots of flavours | positive |
| But the service was below average and the chips were too terrible to finish. | service | negative |
| But the service was below average and the chips were too terrible to finish. | chips | negative |
+----------------------------------------------------------------------------------------------------+-------------------+-----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_aspect_based_sentiment|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[token, embeddings]|
|Output Labels:|[absa]|
|Language:|en|
---
layout: model
title: English asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6 TFWav2Vec2ForCTC from chrisvinsen
author: John Snow Labs
name: asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6` is a English model originally trained by chrisvinsen.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6_en_4.2.0_3.0_1664106725822.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6_en_4.2.0_3.0_1664106725822.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Detect posology entities (biobert)
author: John Snow Labs
name: ner_posology_biobert
date: 2021-04-01
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Detect Drug, Dosage and administration instructions in text using pretraiend NER model.
## Predicted Entities
`FREQUENCY`, `DRUG`, `STRENGTH`, `FORM`, `DURATION`, `DOSAGE`, `ROUTE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_biobert_en_3.0.0_3.0_1617260806766.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_biobert_en_3.0.0_3.0_1617260806766.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_posology_biobert", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_posology_biobert", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.posology.biobert").predict("""Put your text here.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("xlm_roberta_base_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("xlm_roberta_base_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PERSON |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_base_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Community|
|Language:|en|
|Size:|851.9 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- XlmRoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: English asr_wav2vec2_large_xlsr_ksponspeech_1_20 TFWav2Vec2ForCTC from cheulyop
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_ksponspeech_1_20
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_ksponspeech_1_20` is a English model originally trained by cheulyop.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_ksponspeech_1_20_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_ksponspeech_1_20_en_4.2.0_3.0_1664097388003.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_ksponspeech_1_20_en_4.2.0_3.0_1664097388003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_ksponspeech_1_20", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_ksponspeech_1_20", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_ksponspeech_1_20|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: German Electra Embeddings (from stefan-it)
author: John Snow Labs
name: electra_embeddings_electra_base_gc4_64k_400000_cased_generator
date: 2022-05-17
tags: [de, open_source, electra, embeddings]
task: Embeddings
language: de
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-400000-cased-generator` is a German model orginally trained by `stefan-it`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_400000_cased_generator_de_3.4.4_3.0_1652786393218.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_400000_cased_generator_de_3.4.4_3.0_1652786393218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_400000_cased_generator","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_400000_cased_generator","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_electra_base_gc4_64k_400000_cased_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|de|
|Size:|223.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/stefan-it/electra-base-gc4-64k-400000-cased-generator
- https://german-nlp-group.github.io/projects/gc4-corpus.html
- https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf
---
layout: model
title: Translate English to Azerbaijani Pipeline
author: John Snow Labs
name: translate_en_az
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, az, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `az`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_az_xx_2.7.0_2.4_1609685842780.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_az_xx_2.7.0_2.4_1609685842780.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_az", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_az", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.az').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_az|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Electronics And Electrical Engineering Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_electronics_and_electrical_engineering_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, electronics_and_electrical_engineering, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_electronics_and_electrical_engineering_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Electronics_and_Electrical_Engineering or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Electronics_and_Electrical_Engineering`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_electronics_and_electrical_engineering_bert_en_1.0.0_3.0_1678111679765.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_electronics_and_electrical_engineering_bert_en_1.0.0_3.0_1678111679765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Electronics_and_Electrical_Engineering]|
|[Other]|
|[Other]|
|[Electronics_and_Electrical_Engineering]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_electronics_and_electrical_engineering_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.8 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Electronics_and_Electrical_Engineering 0.86 0.83 0.84 58
Other 0.85 0.88 0.86 64
accuracy - - 0.85 122
macro-avg 0.85 0.85 0.85 122
weighted-avg 0.85 0.85 0.85 122
```
---
layout: model
title: Legal Disclosure Of Information Clause Binary Classifier
author: John Snow Labs
name: legclf_disclosure_of_information_clause
date: 2023-01-27
tags: [en, legal, classification, disclosure, information, clauses, disclosure_of_information, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `disclosure-of-information` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`disclosure-of-information`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_disclosure_of_information_clause_en_1.0.0_3.0_1674820995298.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_disclosure_of_information_clause_en_1.0.0_3.0_1674820995298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
| | RxNormCode | Resolution | all_k_results | all_k_distances | all_k_cosine_distances | all_k_resolutions | all_k_aux_labels |
|---:|-------------:|:-------------------------------------------------------------------|:----------------------------------|:----------------------------------|:----------------------------------|:----------------------------------------------------------------|:----------------------------------|
| 0 | 855333 | warfarin sodium 5 MG [Coumadin] | 855333:::432467:::438740:::103... | 0.0000:::5.0617:::5.0617:::5.9... | 0.0000:::0.0388:::0.0388:::0.0... | warfarin sodium 5 MG [Coumadin]:::coumarin 5 MG Oral Tablet:... | Branded Drug Comp:::Clinical D... |
| 1 | 1537020 | aspirin Effervescent Oral Tablet | 1537020:::1191:::405403:::2187... | 0.0000:::0.0000:::9.0615:::9.4... | 0.0000:::0.0000:::0.1268:::0.1... | aspirin Effervescent Oral Tablet:::aspirin:::YSP Aspirin:::N... | Clinical Drug Form:::Ingredien... |
| 2 | 261242 | rosiglitazone 4 MG Oral Tablet [Avandia] | 261242:::208364:::1792373:::57... | 0.0000:::8.0227:::8.1631:::8.2... | 0.0000:::0.0982:::0.1001:::0.1... | rosiglitazone 4 MG Oral Tablet [Avandia]:::triamcinolone 4 M... | Branded Drug:::Branded Drug:::... |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_jsl_rxnorm_augmented|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[rxnorm_code]|
|Language:|en|
|Size:|970.8 MB|
|Case sensitive:|false|
---
layout: model
title: Chinese BertForTokenClassification Base Cased model (from ckiplab)
author: John Snow Labs
name: bert_token_classifier_base_han_chinese_ws
date: 2022-11-30
tags: [zh, open_source, bert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-han-chinese-ws` is a Chinese model originally trained by `ckiplab`.
## Predicted Entities
`B`, `I`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_han_chinese_ws_zh_4.2.4_3.0_1669814901320.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_han_chinese_ws_zh_4.2.4_3.0_1669814901320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_han_chinese_ws","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_han_chinese_ws","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_base_han_chinese_ws|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|zh|
|Size:|395.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/ckiplab/bert-base-han-chinese-ws
- https://github.com/ckiplab/han-transformers
- http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/akiwi/kiwi.sh
- http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/dkiwi/kiwi.sh
- http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/pkiwi/kiwi.sh
- http://asbc.iis.sinica.edu.tw
- https://ckip.iis.sinica.edu.tw/
---
layout: model
title: English asr_model_sid_voxforge_cetuc_2 TFWav2Vec2ForCTC from joaoalvarenga
author: John Snow Labs
name: asr_model_sid_voxforge_cetuc_2
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_sid_voxforge_cetuc_2` is a English model originally trained by joaoalvarenga.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_model_sid_voxforge_cetuc_2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_model_sid_voxforge_cetuc_2_en_4.2.0_3.0_1664022318789.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_model_sid_voxforge_cetuc_2_en_4.2.0_3.0_1664022318789.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_model_sid_voxforge_cetuc_2", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_model_sid_voxforge_cetuc_2", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_model_sid_voxforge_cetuc_2|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Detect Drug Information (Small)
author: John Snow Labs
name: ner_posology
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for posology, this NER is trained with the ``embeddings_clinical`` word embeddings model, so be sure to use the same embeddings in the pipeline.
## Predicted Entities
``DOSAGE``, ``DRUG``, ``DURATION``, ``FORM``, ``FREQUENCY``, ``ROUTE``, ``STRENGTH``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_en_3.0.0_2.4_1617208445872.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_en_3.0.0_2.4_1617208445872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.']], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val model = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, model, ner_converter))
val data = Seq("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.posology").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""")
```
## Results
```bash
+--------------+---------+
|chunk |ner |
+--------------+---------+
|insulin |DRUG |
|Bactrim |DRUG |
|for 14 days |DURATION |
|Fragmin |DRUG |
|5000 units |DOSAGE |
|subcutaneously|ROUTE |
|daily |FREQUENCY|
|Xenaderm |DRUG |
|topically |ROUTE |
|b.i.d., |FREQUENCY|
|Lantus |DRUG |
|40 units |DOSAGE |
|subcutaneously|ROUTE |
|at bedtime |FREQUENCY|
|OxyContin |DRUG |
|30 mg |STRENGTH |
|p.o |ROUTE |
|q.12 h |FREQUENCY|
|folic acid |DRUG |
|1 mg |STRENGTH |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_posology|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
Trained on the 2018 i2b2 dataset (no FDA) with ``embeddings_clinical``.
https://www.i2b2.org/NLP/Medication
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|--------------:|------:|------:|------:|---------:|---------:|---------:|
| 0 | B-DRUG | 1408 | 62 | 99 | 0.957823 | 0.934307 | 0.945919 |
| 1 | B-STRENGTH | 470 | 43 | 29 | 0.916179 | 0.941884 | 0.928854 |
| 2 | I-DURATION | 123 | 22 | 8 | 0.848276 | 0.938931 | 0.891304 |
| 3 | I-STRENGTH | 499 | 66 | 15 | 0.883186 | 0.970817 | 0.924931 |
| 4 | I-FREQUENCY | 945 | 47 | 55 | 0.952621 | 0.945 | 0.948795 |
| 5 | B-FORM | 365 | 13 | 12 | 0.965608 | 0.96817 | 0.966887 |
| 6 | B-DOSAGE | 298 | 27 | 26 | 0.916923 | 0.919753 | 0.918336 |
| 7 | I-DOSAGE | 348 | 29 | 22 | 0.923077 | 0.940541 | 0.931727 |
| 8 | I-DRUG | 208 | 25 | 60 | 0.892704 | 0.776119 | 0.830339 |
| 9 | I-ROUTE | 10 | 0 | 2 | 1 | 0.833333 | 0.909091 |
| 10 | B-ROUTE | 467 | 4 | 25 | 0.991507 | 0.949187 | 0.969886 |
| 11 | B-DURATION | 64 | 10 | 10 | 0.864865 | 0.864865 | 0.864865 |
| 12 | B-FREQUENCY | 588 | 12 | 17 | 0.98 | 0.971901 | 0.975934 |
| 13 | I-FORM | 264 | 5 | 4 | 0.981413 | 0.985075 | 0.98324 |
| 14 | Macro-average | 6057 | 365 | 384 | 0.93387 | 0.924277 | 0.929049 |
| 15 | Micro-average | 6057 | 365 | 384 | 0.943164 | 0.940382 | 0.941771 |
```
---
layout: model
title: Legal Performance Clause Binary Classifier
author: John Snow Labs
name: legclf_performance_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `performance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `performance`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_performance_clause_en_1.0.0_3.2_1660123818700.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_performance_clause_en_1.0.0_3.2_1660123818700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[performance]|
|[other]|
|[other]|
|[performance]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_performance_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.90 1.00 0.95 89
performance 1.00 0.74 0.85 39
accuracy - - 0.92 128
macro-avg 0.95 0.87 0.90 128
weighted-avg 0.93 0.92 0.92 128
```
---
layout: model
title: Translate English to Tsonga Pipeline
author: John Snow Labs
name: translate_en_ts
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, ts, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `ts`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ts_xx_2.7.0_2.4_1609699062996.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ts_xx_2.7.0_2.4_1609699062996.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_ts", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_ts", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.ts').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_ts|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Finnish BERT Embeddings (Base Uncased)
author: John Snow Labs
name: bert_finnish_uncased
date: 2020-08-31
task: Embeddings
language: fi
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, fi]
supported: true
deprecated: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A version of Google's BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. `FinBERT` features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words.
`FinBERT` has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train `FinBERT`.
These features allow `FinBERT` to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_finnish_uncased_fi_2.6.0_2.4_1598897239983.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_finnish_uncased_fi_2.6.0_2.4_1598897239983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("bert_finnish_uncased", "fi") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['Rakastan NLP: tä']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("bert_finnish_uncased", "fi")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("Rakastan NLP: tä").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["Rakastan NLP: tä"]
embeddings_df = nlu.load('fi.embed.bert.uncased.').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token fi_embed_bert_uncased__embeddings
Rakastan [-0.5126021504402161, -1.1741008758544922, 0.6...
NLP [1.4763829708099365, -1.5427947044372559, 0.80...
: [-0.2581554353237152, -0.5670831203460693, -1....
tä [0.39770740270614624, -0.7221324443817139, 0.1...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_finnish_uncased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[fi]|
|Dimension:|768|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://github.com/TurkuNLP/FinBERT
---
layout: model
title: English asr_sanskrit TFWav2Vec2ForCTC from Tarakki100
author: John Snow Labs
name: asr_sanskrit
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_sanskrit` is a English model originally trained by Tarakki100.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_sanskrit_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_sanskrit_en_4.2.0_3.0_1664112373546.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_sanskrit_en_4.2.0_3.0_1664112373546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_sanskrit", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_sanskrit", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_sanskrit|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|227.9 MB|
---
layout: model
title: Pre-trained Pipeline for Few-NERD NER Model
author: John Snow Labs
name: nerdl_fewnerd_subentity_100d_pipeline
date: 2022-06-28
tags: [fewnerd, ner, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on Few-NERD/inter public dataset and it extracts 66 entities that are in general scope.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_FEW_NERD/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FewNERD.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_subentity_100d_pipeline_en_4.0.0_3.0_1656388795031.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_subentity_100d_pipeline_en_4.0.0_3.0_1656388795031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
fewnerd_pipeline = PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en")
fewnerd_pipeline.annotate("""12 Corazones ('12 Hearts') is Spanish-language dating game show produced in the United States for the television network Telemundo since January 2005, based on its namesake Argentine TV show format. The show is filmed in Los Angeles and revolves around the twelve Zodiac signs that identify each contestant. In 2008, Ho filmed a cameo in the Steven Spielberg feature film The Cloverfield Paradox, as a news pundit.""")
```
```scala
val pipeline = new PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en")
val result = pipeline.fullAnnotate("12 Corazones ('12 Hearts') is Spanish-language dating game show produced in the United States for the television network Telemundo since January 2005, based on its namesake Argentine TV show format. The show is filmed in Los Angeles and revolves around the twelve Zodiac signs that identify each contestant. In 2008, Ho filmed a cameo in the Steven Spielberg feature film The Cloverfield Paradox, as a news pundit.")(0)
```
## Results
```bash
+-----------------------+----------------------------+
|chunk |ner_label |
+-----------------------+----------------------------+
|Corazones ('12 Hearts')|art-broadcastprogram |
|Spanish-language |other-language |
|United States |location-GPE |
|Telemundo |organization-media/newspaper|
|Argentine TV |organization-media/newspaper|
|Los Angeles |location-GPE |
|Steven Spielberg |person-director |
|Cloverfield Paradox |art-film |
+-----------------------+----------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|nerdl_fewnerd_subentity_100d_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|167.8 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- NerDLModel
- NerConverter
- Finisher
---
layout: model
title: English RoBERTa Embeddings (Base, Biomarkers/Carcinoma/Clinical Trial)
author: John Snow Labs
name: roberta_embeddings_roberta_pubmed
date: 2022-04-14
tags: [roberta, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-pubmed` is a English model orginally trained by `raynardj`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_pubmed_en_3.4.2_3.0_1649946815266.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_pubmed_en_3.4.2_3.0_1649946815266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_pubmed","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_pubmed","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.roberta_pubmed").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_roberta_pubmed|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|468.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/raynardj/roberta-pubmed
- https://pubmed.ncbi.nlm.nih.gov/
- https://www.ncbi.nlm.nih.gov/mesh/
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from cosmo)
author: John Snow Labs
name: distilbert_qa_cosmo_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `cosmo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cosmo_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770516755.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cosmo_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770516755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cosmo_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cosmo_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_cosmo_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/cosmo/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Fast Neural Machine Translation Model from English to Catalan
author: John Snow Labs
name: opus_mt_en_ca
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ca, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ca`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ca_xx_2.7.0_2.4_1609167744249.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ca_xx_2.7.0_2.4_1609167744249.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ca", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ca", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ca').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ca|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from skandaonsolve)
author: John Snow Labs
name: roberta_qa_finetuned_location
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-location` is a English model originally trained by `skandaonsolve`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_location_en_4.3.0_3.0_1674220382399.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_location_en_4.3.0_3.0_1674220382399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_location","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_location","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_finetuned_location|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.7 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/skandaonsolve/roberta-finetuned-location
---
layout: model
title: Finnish Legal Roberta Embeddings
author: John Snow Labs
name: roberta_base_finnish_legal
date: 2023-02-16
tags: [fi, finnish, embeddings, transformer, open_source, legal, tensorflow]
task: Embeddings
language: fi
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-finnish-roberta-base` is a Finnish model originally trained by `joelito`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_finnish_legal_fi_4.2.4_3.0_1676561071432.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_finnish_legal_fi_4.2.4_3.0_1676561071432.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_base_finnish_legal|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fi|
|Size:|416.0 MB|
|Case sensitive:|true|
## References
https://huggingface.co/joelito/legal-finnish-roberta-base
---
layout: model
title: Stopwords Remover for Dutch language (352 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, nl, open_source]
task: Stop Words Removal
language: nl
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_nl_3.4.1_3.0_1646673228420.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_nl_3.4.1_3.0_1646673228420.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","nl") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Je bent niet beter dan ik"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","nl")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Je bent niet beter dan ik").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("nl.stopwords").predict("""Je bent niet beter dan ik""")
```
## Results
```bash
+------+
|result|
+------+
|[] |
+------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|nl|
|Size:|2.4 KB|
---
layout: model
title: Finance Pipeline (Headers / Subheaders)
author: John Snow Labs
name: finpipe_header_subheader
date: 2023-01-20
tags: [en, finance, ner, licensed, contextual_parser]
task: Named Entity Recognition
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a finance pretrained pipeline that will help you split long financial documents into smaller sections. To do that, it detects Headers and Subheaders of different sections. You can then use the beginning and end information in the metadata to retrieve the text between those headers.
PART I, PART II, etc are HEADERS
Item 1, Item 2, etc are also HEADERS
Item 1A, 2B, etc are SUBHEADERS
1., 2., 2.1, etc. are SUBHEADERS
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finpipe_header_subheader_en_1.0.0_3.0_1674243435691.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finpipe_header_subheader_en_1.0.0_3.0_1674243435691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
finance_pipeline = nlp.PretrainedPipeline("finpipe_header_subheader", "en", "finance/models")
text = ["""
Item 2. Definitions.
For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller.
Item 2A. Appointment.
The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6
Item 2B. Customer Agreements."""]
result = finance_pipeline.annotate(text)
```
## Results
```bash
| chunks | begin | end | entities |
|------------------------------:|------:|----:|----------:|
| Item 2. Definitions. | 1 | 21 | HEADER |
| Item 2A. Appointment. | 158 | 179 | SUBHEADER |
| Item 2B. Customer Agreements. | 538 | 566 | SUBHEADER |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finpipe_header_subheader|
|Type:|pipeline|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|23.6 KB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
---
layout: model
title: English asr_wav2vec2_base_timit_demo_by_patrickvonplaten TFWav2Vec2ForCTC from patrickvonplaten
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_by_patrickvonplaten` is a English model originally trained by patrickvonplaten.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten_en_4.2.0_3.0_1664025434108.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten_en_4.2.0_3.0_1664025434108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|349.4 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English Part of Speech Tagger (from KoichiYasuoka)
author: John Snow Labs
name: xlmroberta_pos_xlm_roberta_base_english_upos
date: 2022-05-18
tags: [xlm_roberta, pos, part_of_speech, en, open_source]
task: Part of Speech Tagging
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-english-upos` is a English model orginally trained by `KoichiYasuoka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_pos_xlm_roberta_base_english_upos_en_3.4.2_3.0_1652837577026.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_pos_xlm_roberta_base_english_upos_en_3.4.2_3.0_1652837577026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_pos_xlm_roberta_base_english_upos","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_pos_xlm_roberta_base_english_upos","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_pos_xlm_roberta_base_english_upos|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|791.1 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/KoichiYasuoka/xlm-roberta-base-english-upos
- https://github.com/UniversalDependencies/UD_English-EWT
- https://universaldependencies.org/u/pos/
- https://github.com/KoichiYasuoka/esupar
---
layout: model
title: Translate Kinyarwanda to English Pipeline
author: John Snow Labs
name: translate_rw_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, rw, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `rw`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_rw_en_xx_2.7.0_2.4_1609687466102.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_rw_en_xx_2.7.0_2.4_1609687466102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_rw_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_rw_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.rw.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_rw_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Spanish RobertaForQuestionAnswering Large Cased model (from stevemobs)
author: John Snow Labs
name: roberta_qa_large_fine_tuned_squad
date: 2023-01-20
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-fine-tuned-squad-es` is a Spanish model originally trained by `stevemobs`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_fine_tuned_squad_es_4.3.0_3.0_1674221753097.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_fine_tuned_squad_es_4.3.0_3.0_1674221753097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_fine_tuned_squad","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_fine_tuned_squad","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_large_fine_tuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/stevemobs/roberta-large-fine-tuned-squad-es
---
layout: model
title: English asr_wav2vec2_xls_r_300m_cv8 TFWav2Vec2ForCTC from comodoro
author: John Snow Labs
name: asr_wav2vec2_xls_r_300m_cv8
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_cv8` is a English model originally trained by comodoro.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_cv8_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_cv8_en_4.2.0_3.0_1664036662517.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_cv8_en_4.2.0_3.0_1664036662517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xls_r_300m_cv8", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xls_r_300m_cv8", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xls_r_300m_cv8|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Fake News Classifier
author: John Snow Labs
name: classifierdl_use_fakenews
date: 2021-01-09
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 2.7.1
spark_version: 2.4
tags: [open_source, en, classifier]
supported: true
annotator: ClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Determine if news articles are Real or Fake.
## Predicted Entities
`REAL`, `FAKE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_FAKENEWS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_FAKENEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_fakenews_en_2.7.1_2.4_1610187399147.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_fakenews_en_2.7.1_2.4_1610187399147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
document_classifier = ClassifierDLModel.pretrained('classifierdl_use_fakenews', 'en') \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")
nlpPipeline = Pipeline(stages=[document_assembler, use, document_classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate('Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton')
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val use = UniversalSentenceEncoder.pretrained(lang="en")
.setInputCols(Array("document"))
.setOutputCol("sentence_embeddings")
val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_fakenews", "en")
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier))
val data = Seq("Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton"""]
fake_df = nlu.load('classify.fakenews.use').predict(text, output_level='document')
fake_df[["document", "fakenews"]]
```
## Results
```bash
+--------------------------------------------------------------------------------------------------------+------------+
|document |class |
+--------------------------------------------------------------------------------------------------------+------------+
|Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton | FAKE |
+--------------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_use_fakenews|
|Compatibility:|Spark NLP 2.7.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Dependencies:|tfhub_use|
## Data Source
This model is trained on the fake new classification challenge. https://raw.githubusercontent.com/joolsa/fake_real_news_dataset/master/fake_or_real_news.csv.zip
## Benchmarking
```bash
precision recall f1-score support
FAKE 0.86 0.89 0.88 626
REAL 0.89 0.86 0.87 634
accuracy 0.87 1260
macro avg 0.88 0.87 0.87 1260
weighted avg 0.88 0.87 0.87 1260
```
---
layout: model
title: Fast Neural Machine Translation Model from Chuukese to English
author: John Snow Labs
name: opus_mt_chk_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, chk, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `chk`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_chk_en_xx_2.7.0_2.4_1609169397835.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_chk_en_xx_2.7.0_2.4_1609169397835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_chk_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_chk_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.chk.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_chk_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman` is a Finnish model originally trained by jonatasgrosman.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman_fi_4.2.0_3.0_1664041894280.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman_fi_4.2.0_3.0_1664041894280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman', lang = 'fi')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman", lang = "fi")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English BertForQuestionAnswering model (from deepset)
author: John Snow Labs
name: bert_qa_minilm_uncased_squad2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minilm-uncased-squad2` is a English model orginally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_minilm_uncased_squad2_en_4.0.0_3.0_1654188279232.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_minilm_uncased_squad2_en_4.0.0_3.0_1654188279232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_minilm_uncased_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_minilm_uncased_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert.mini_lm_base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_minilm_uncased_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|123.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/deepset/minilm-uncased-squad2
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py
- https://twitter.com/deepset_ai
- http://www.deepset.ai/jobs
- https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack/
- https://deepset.ai/german-bert
- https://www.linkedin.com/company/deepset-ai/
- https://github.com/deepset-ai/FARM
- https://deepset.ai/germanquad
---
layout: model
title: Abkhazian asr_xls_r_ab_test_by_muneson TFWav2Vec2ForCTC from muneson
author: John Snow Labs
name: pipeline_asr_xls_r_ab_test_by_muneson
date: 2022-09-24
tags: [wav2vec2, ab, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: ab
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_muneson` is a Abkhazian model originally trained by muneson.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_ab_test_by_muneson_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_muneson_ab_4.2.0_3.0_1664019208833.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_muneson_ab_4.2.0_3.0_1664019208833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_xls_r_ab_test_by_muneson', lang = 'ab')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_ab_test_by_muneson", lang = "ab")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_xls_r_ab_test_by_muneson|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|ab|
|Size:|452.2 KB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from sasuke)
author: John Snow Labs
name: distilbert_qa_sasuke_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `sasuke`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sasuke_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772441119.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sasuke_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772441119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sasuke_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sasuke_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_sasuke_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/sasuke/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Financial Deidentification Pipeline
author: John Snow Labs
name: finpipe_deid
date: 2023-02-27
tags: [deid, deidentification, anonymization, en, licensed]
task: [De-identification, Pipeline Finance]
language: en
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
recommended: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Pretrained Pipeline aimed to deidentify legal and financial documents to be compliant with data privacy regulations as GDPR and CCPA. Since the models used in this pipeline are statistical, make sure you use this model in a human-in-the-loop process to guarantee a 100% accuracy.
You can carry out both masking and obfuscation with this pipeline, on the following entities:
`ALIAS`, `EMAIL`, `PHONE`, `PROFESSION`, `ORG`, `DATE`, `PERSON`, `ADDRESS`, `STREET`, `CITY`, `STATE`, `ZIP`, `COUNTRY`, `TITLE_CLASS`, `TICKER`, `STOCK_EXCHANGE`, `CFN`, `IRS`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/DEID_FIN/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finpipe_deid_en_1.0.0_3.0_1677508149273.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finpipe_deid_en_1.0.0_3.0_1677508149273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("finpipe_deid", "en", "finance/models")
sample = """CARGILL, INCORPORATED
By: Pirkko Suominen
Name: Pirkko Suominen Title: Director, Bio Technology Development, Date: 10/19/2011
BIOAMBER, SAS
By: Jean-François Huc
Name: Jean-François Huc Title: President Date: October 15, 2011
email : jeanfran@gmail.com
phone : 1808733909
"""
result = deid_pipeline.annotate(sample)
print("\nMasked with entity labels")
print("-"*30)
print("\n".join(result['deidentified']))
print("\nMasked with chars")
print("-"*30)
print("\n".join(result['masked_with_chars']))
print("\nMasked with fixed length chars")
print("-"*30)
print("\n".join(result['masked_fixed_length_chars']))
print("\nObfuscated")
print("-"*30)
print("\n".join(result['obfuscated']))
```
## Results
```bash
Masked with entity labels
------------------------------
,
By:
Name: : , Date: ,
By:
Name: : Date:
email :
phone :
Masked with chars
------------------------------
[*****], [**********]
By: [*************]
Name: [*******************]: [**********************************] Center, Date: [********]
[******], [*]
By: [***************]
Name: [**********************]: [*******]Date: [**************]
email : [****************]
phone : [********]
Masked with fixed length chars
------------------------------
****, ****
By: ****
Name: ****: ****, Date: ****
****, ****
By: ****
Name: ****: ****Date: ****
email : ****
phone : ****
Obfuscated
------------------------------
MGT Trust Company, LLC., Clarus llc.
By: Benjamin Dean
Name: John Snow Labs Inc: Sales Manager, Date: 03/08/2025
Clarus llc., SESA CO.
By: JAMES TURNER
Name: MGT Trust Company, LLC.: Business ManagerDate: 11/7/2016
email : Tyrus@google.com
phone : 78 834 854
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finpipe_deid|
|Type:|pipeline|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|458.6 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- BertEmbeddings
- FinanceNerModel
- NerConverterInternalModel
- FinanceNerModel
- NerConverterInternalModel
- FinanceNerModel
- NerConverterInternalModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
---
layout: model
title: Detect Clinical Conditions (ner_eu_clinical_condition - it)
author: John Snow Labs
name: ner_eu_clinical_condition
date: 2023-02-06
tags: [it, clinical, licensed, ner, clinical_condition]
task: Named Entity Recognition
language: it
edition: Healthcare NLP 4.2.8
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition (NER) deep learning model for extracting clinical conditions from Italian texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN.
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Predicted Entities
`clinical_condition`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_it_4.2.8_3.0_1675726754516.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_it_4.2.8_3.0_1675726754516.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","it")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained('ner_eu_clinical_condition', "it", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["""Donna, 64 anni, ricovero per dolore epigastrico persistente, irradiato a barra e posteriormente, associato a dispesia e anoressia. Poche settimane dopo compaiono, però, iperemia, intenso edema vulvare ed una esione ulcerativa sul lato sinistro della parete rettale che la RM mostra essere una fistola transfinterica. Questi trattamenti determinano un miglioramento dell’ infiammazione ed una riduzione dell’ ulcera, ma i condilomi permangono inalterati."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","it")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_condition", "it", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter))
val data = Seq(Array("""Donna, 64 anni, ricovero per dolore epigastrico persistente, irradiato a barra e posteriormente, associato a dispesia e anoressia. Poche settimane dopo compaiono, però, iperemia, intenso edema vulvare ed una esione ulcerativa sul lato sinistro della parete rettale che la RM mostra essere una fistola transfinterica. Questi trattamenti determinano un miglioramento dell’ infiammazione ed una riduzione dell’ ulcera, ma i condilomi permangono inalterati.""")).toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+----------------------+------------------+
|chunk |ner_label |
+----------------------+------------------+
|dolore epigastrico |clinical_condition|
|anoressia |clinical_condition|
|iperemia |clinical_condition|
|edema |clinical_condition|
|fistola transfinterica|clinical_condition|
|infiammazione |clinical_condition|
+----------------------+------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_eu_clinical_condition|
|Compatibility:|Healthcare NLP 4.2.8+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|it|
|Size:|903.5 KB|
## References
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Benchmarking
```bash
label tp fp fn total precision recall f1
clinical_condition 208.0 35.0 46.0 254.0 0.8560 0.8189 0.8370
macro - - - - - - 0.8370
micro - - - - - - 0.8370
```
---
layout: model
title: English RobertaForQuestionAnswering (from saburbutt)
author: John Snow Labs
name: roberta_qa_roberta_large_tweetqa
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_large_tweetqa` is a English model originally trained by `saburbutt`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_tweetqa_en_4.0.0_3.0_1655739122606.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_tweetqa_en_4.0.0_3.0_1655739122606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_tweetqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_large_tweetqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.trivia.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_large_tweetqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/saburbutt/roberta_large_tweetqa
---
layout: model
title: English image_classifier_vit_iiif_manuscript_ ViTForImageClassification from davanstrien
author: John Snow Labs
name: image_classifier_vit_iiif_manuscript_
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_iiif_manuscript_` is a English model originally trained by davanstrien.
## Predicted Entities
`3rd upper flyleaf verso`, `Blank leaf recto`, `3rd lower flyleaf verso`, `2nd lower flyleaf verso`, `2nd upper flyleaf verso`, `flyleaf`, `1st upper flyleaf verso`, `1st lower flyleaf verso`, `fol`, `cover`, `Lower flyleaf verso`, `Blank leaf verso`, `Upper flyleaf verso`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_iiif_manuscript__en_4.1.0_3.0_1660170321999.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_iiif_manuscript__en_4.1.0_3.0_1660170321999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_iiif_manuscript_", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_iiif_manuscript_", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_iiif_manuscript_|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Legal Purchase and sale Clause Binary Classifier
author: John Snow Labs
name: legclf_purchase_and_sale_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `purchase-and-sale` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `purchase-and-sale`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_purchase_and_sale_clause_en_1.0.0_3.2_1660123863834.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_purchase_and_sale_clause_en_1.0.0_3.2_1660123863834.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[purchase-and-sale]|
|[other]|
|[other]|
|[purchase-and-sale]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_purchase_and_sale_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.93 1.00 0.96 65
purchase-and-sale 1.00 0.88 0.94 43
accuracy - - 0.95 108
macro-avg 0.96 0.94 0.95 108
weighted-avg 0.96 0.95 0.95 108
```
---
layout: model
title: Detect Problems, Tests and Treatments (ner_clinical) in German
author: John Snow Labs
name: ner_clinical
date: 2023-05-05
tags: [ner, clinical, licensed, de]
task: Named Entity Recognition
language: de
edition: Healthcare NLP 4.4.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terms in German. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
`PROBLEM`, `TEST`, `TREATMENT`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_de_4.4.0_3.0_1683310968546.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_de_4.4.0_3.0_1683310968546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "de", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
sample_text= """Verschlechterung von Schmerzen oder Schwäche in den Beinen , Verlust der Darm - oder Blasenfunktion oder andere besorgniserregende Symptome.
Der Patient erhielt empirisch Ampicillin , Gentamycin und Flagyl sowie Narcan zur Umkehrung von Fentanyl .
ALT war 181 , AST war 156 , LDH war 336 , alkalische Phosphatase war 214 und Bilirubin war insgesamt 12,7 ."""
results = model.transform(spark.createDataFrame([[sample_text]], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_clinical", "de", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""Verschlechterung von Schmerzen oder Schwäche in den Beinen , Verlust der Darm - oder Blasenfunktion oder andere besorgniserregende Symptome.
Der Patient erhielt empirisch Ampicillin , Gentamycin und Flagyl sowie Narcan zur Umkehrung von Fentanyl .
ALT war 181 , AST war 156 , LDH war 336 , alkalische Phosphatase war 214 und Bilirubin war insgesamt 12,7 .""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_javanese_bert_small_imdb","jv") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_javanese_bert_small_imdb","jv")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("jv.embed.javanese_bert_small_imdb").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_javanese_bert_small_imdb|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|jv|
|Size:|410.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/w11wo/javanese-bert-small-imdb
- https://arxiv.org/abs/1810.04805
- https://github.com/sgugger
- https://w11wo.github.io/
---
layout: model
title: Legal Bankruptcy Clause Binary Classifier
author: John Snow Labs
name: legclf_bankruptcy_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `bankruptcy` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `bankruptcy`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bankruptcy_clause_en_1.0.0_3.2_1660122157662.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bankruptcy_clause_en_1.0.0_3.2_1660122157662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[bankruptcy]|
|[other]|
|[other]|
|[bankruptcy]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_bankruptcy_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
bankruptcy 1.00 0.73 0.84 26
other 0.94 1.00 0.97 107
accuracy - - 0.95 133
macro-avg 0.97 0.87 0.91 133
weighted-avg 0.95 0.95 0.94 133
```
---
layout: model
title: Detect PHI for Deidentification purposes (Spanish, Roberta, augmented)
author: John Snow Labs
name: ner_deid_subentity_roberta_augmented
date: 2022-02-16
tags: [deid, es, licensed,clinical]
task: De-identification
language: es
edition: Healthcare NLP 3.3.4
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities, which is more than the previously released `ner_deid_subentity_roberta` model.
This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf and MeddoCan datasets, and includes several data augmentation mechanisms.
This is a version that includes Roberta Clinical embeddings. You can find as well `ner_deid_subentity_augmented` that uses Sciwi 300d embeddings based instead of Roberta.
## Predicted Entities
`PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `USERNAME`, `SEX`, `EMAIL`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_roberta_augmented_es_3.3.4_3.0_1645006804071.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_roberta_augmented_es_3.3.4_3.0_1645006804071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_roberta_augmented", "es", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
roberta_embeddings,
clinical_ner])
text = ['''
Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
''']
df = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(df).transform(df)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_roberta_augmented", "es", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
roberta_embeddings,
clinical_ner))
val text = "Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."
val df = Seq(text).toDF("text")
val results = pipeline.fit(df).transform(df)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.med_ner.deid.subentity.roberta").predict("""
Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_small_grammar_v2","ro") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_small_grammar_v2","ro")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_small_grammar_v2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|ro|
|Size:|288.1 MB|
## References
- https://huggingface.co/BlackKakapo/t5-small-grammar-ro-v2
- https://img.shields.io/badge/V.2-06.08.2022-brightgreen
---
layout: model
title: Translate Lushai to English Pipeline
author: John Snow Labs
name: translate_lus_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, lus, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `lus`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_lus_en_xx_2.7.0_2.4_1609686287332.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_lus_en_xx_2.7.0_2.4_1609686287332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_lus_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_lus_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.lus.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_lus_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English AlbertForQuestionAnswering model (from sultan)
author: John Snow Labs
name: albert_qa_BioM_xxlarge_SQuAD2
date: 2022-06-24
tags: [en, open_source, albert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: AlBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BioM-ALBERT-xxlarge-SQuAD2` is a English model originally trained by `sultan`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_BioM_xxlarge_SQuAD2_en_4.0.0_3.0_1656063644904.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_BioM_xxlarge_SQuAD2_en_4.0.0_3.0_1656063644904.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_BioM_xxlarge_SQuAD2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_BioM_xxlarge_SQuAD2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.albert.xxl.by_sultan").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_qa_BioM_xxlarge_SQuAD2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|771.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sultan/BioM-ALBERT-xxlarge-SQuAD2
- http://participants-area.bioasq.org/results/9b/phaseB/
- https://github.com/salrowili/BioM-Transformers
---
layout: model
title: Italian BertForMaskedLM Base Uncased model (from dbmdz)
author: John Snow Labs
name: bert_embeddings_base_italian_xxl_uncased
date: 2022-12-02
tags: [it, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: it
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-italian-xxl-uncased` is a Italian model originally trained by `dbmdz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_xxl_uncased_it_4.2.4_3.0_1670018034736.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_xxl_uncased_it_4.2.4_3.0_1670018034736.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_xxl_uncased","it") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_xxl_uncased","it")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_italian_xxl_uncased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|it|
|Size:|415.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased
- http://opus.nlpl.eu/
- https://traces1.inria.fr/oscar/
- https://github.com/dbmdz/berts/issues/7
- https://github.com/stefan-it/turkish-bert/tree/master/electra
- https://github.com/stefan-it/italian-bertelectra
- https://github.com/dbmdz/berts/issues/new
---
layout: model
title: Legal Registration Clause Binary Classifier
author: John Snow Labs
name: legclf_registration_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `registration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `registration`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_registration_clause_en_1.0.0_3.2_1660123909144.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_registration_clause_en_1.0.0_3.2_1660123909144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[registration]|
|[other]|
|[other]|
|[registration]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_registration_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.90 0.97 0.93 65
registration-rights 0.94 0.83 0.88 41
accuracy - - 0.92 106
macro-avg 0.92 0.90 0.91 106
weighted-avg 0.92 0.92 0.91 106
```
---
layout: model
title: Pipeline to Detect Chemical Compounds and Genes
author: John Snow Labs
name: ner_chemprot_clinical_pipeline
date: 2023-03-15
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_chemprot_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_chemprot_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_pipeline_en_4.3.0_3.2_1678865440862.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_pipeline_en_4.3.0_3.2_1678865440862.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_chemprot_clinical_pipeline", "en", "clinical/models")
text = '''Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_chemprot_clinical_pipeline", "en", "clinical/models")
val text = "Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.chemprot_clinical.pipeline").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:-------------|--------:|------:|:------------|-------------:|
| 0 | Keratinocyte | 0 | 11 | GENE-Y | 0.7433 |
| 1 | growth | 13 | 18 | GENE-Y | 0.6481 |
| 2 | factor | 20 | 25 | GENE-Y | 0.5693 |
| 3 | acidic | 31 | 36 | GENE-Y | 0.5518 |
| 4 | fibroblast | 38 | 47 | GENE-Y | 0.5111 |
| 5 | growth | 49 | 54 | GENE-Y | 0.4559 |
| 6 | factor | 56 | 61 | GENE-Y | 0.5213 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_chemprot_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English RobertaForQuestionAnswering Large Cased model (from deepset)
author: John Snow Labs
name: roberta_qa_large_squad2_hp
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad2-hp` is a English model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_squad2_hp_en_4.3.0_3.0_1674222219863.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_squad2_hp_en_4.3.0_3.0_1674222219863.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_squad2_hp","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_squad2_hp","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_large_squad2_hp|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/deepset/roberta-large-squad2-hp
---
layout: model
title: Summarize Clinical Question Notes
author: John Snow Labs
name: summarizer_clinical_questions
date: 2023-04-03
tags: [licensed, en, clinical, summarization, tensorflow]
task: Summarization
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MedicalSummarizer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with medical questions exchanged in clinical mediums (clinic, email, call center etc.) by John Snow Labs. It can generate summaries up to 512 tokens given an input text (max 1024 tokens).
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_SUMMARIZATION_QA/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_en_4.3.2_3.0_1680550227628.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_en_4.3.2_3.0_1680550227628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
summarizer = MedicalSummarizer.pretrained("summarizer_clinical_questions", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("summary")\
.setMaxTextLength(512)\
.setMaxNewTokens(512)
pipeline = sparknlp.base.Pipeline(stages=[
document_assembler,
summarizer
])
text = """
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val summarizer = MedicalSummarizer.pretrained("summarizer_clinical_questions", "en", "clinical/models")
.setInputCols("document_prompt")
.setOutputCol("answer")
.setMaxTextLength(512)
.setMaxNewTokens(512)
val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer))
val text = """Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
"""
val data = Seq(Array(text)).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
['What are the treatments for hyperthyroidism?']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|summarizer_clinical_questions|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|920.0 MB|
---
layout: model
title: Word2Vec Embeddings in Dimli (individual language) (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, diq, open_source]
task: Embeddings
language: diq
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_diq_3.4.1_3.0_1647467748159.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_diq_3.4.1_3.0_1647467748159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","diq") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","diq")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("diq.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|diq|
|Size:|101.9 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Sundanese asr_wav2vec2_large_xlsr_sundanese TFWav2Vec2ForCTC from cahya
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_sundanese
date: 2022-09-24
tags: [wav2vec2, su, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: su
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_sundanese` is a Sundanese model originally trained by cahya.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_sundanese_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_sundanese_su_4.2.0_3.0_1664039175545.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_sundanese_su_4.2.0_3.0_1664039175545.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_sundanese', lang = 'su')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_sundanese", lang = "su")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_sundanese|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|su|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Professions & Occupations NER model in Spanish (meddroprof_scielowiki)
author: John Snow Labs
name: meddroprof_scielowiki
date: 2021-07-26
tags: [ner, licensed, prefessions, es, occupations]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 3.1.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
NER model that detects professions and occupations in Spanish texts. Trained with the `embeddings_scielowiki_300d` embeddings, and the same `WordEmbeddingsModel` is needed in the pipeline.
## Predicted Entities
`ACTIVIDAD`, `PROFESION`, `SITUACION_LABORAL`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_PROFESSIONS_ES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_PROFESSIONS_ES.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/meddroprof_scielowiki_es_3.1.3_3.0_1627328955264.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/meddroprof_scielowiki_es_3.1.3_3.0_1627328955264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols("document") \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter])
sample_text = """La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO"""
df = spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(df).transform(df)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter))
val data = Seq("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.med_ner.scielowiki").predict("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""")
```
## Results
```bash
+---------------------------------------+-----------------+
|chunk |ner_label |
+---------------------------------------+-----------------+
|estudiando 1o ESO |SITUACION_LABORAL|
|ATS |PROFESION |
|trabajan en diferentes centros de salud|PROFESION |
|estudiando 1o ESO |SITUACION_LABORAL|
+---------------------------------------+-----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|meddroprof_scielowiki|
|Compatibility:|Healthcare NLP 3.1.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, word_embeddings]|
|Output Labels:|[ner]|
|Language:|es|
|Dependencies:|embeddings_scielowiki_300d|
## Data Source
The model was trained with the [MEDDOPROF](https://temu.bsc.es/meddoprof/data/) data set:
> The MEDDOPROF corpus is a collection of 1844 clinical cases from over 20 different specialties annotated with professions and employment statuses. The corpus was annotated by a team composed of linguists and clinical experts following specially prepared annotation guidelines, after several cycles of quality control and annotation consistency analysis before annotating the entire dataset. Figure 1 shows a screenshot of a sample manual annotation generated using the brat annotation tool.
Reference:
```
@article{meddoprof,
title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts},
author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin},
journal = {Procesamiento del Lenguaje Natural},
volume = {67},
year={2021}
}
```
## Benchmarking
```bash
label precision recall f1-score support
B-ACTIVIDAD 0.82 0.36 0.50 25
B-PROFESION 0.87 0.75 0.81 634
B-SITUACION_LABORAL 0.79 0.67 0.72 310
I-ACTIVIDAD 0.86 0.43 0.57 58
I-PROFESION 0.87 0.80 0.83 944
I-SITUACION_LABORAL 0.74 0.71 0.73 407
O 1.00 1.00 1.00 139880
accuracy - - 0.99 142258
macro-avg 0.85 0.67 0.74 142258
weighted-avg 0.99 0.99 0.99 142258
```
---
layout: model
title: Spanish NER Pipeline
author: John Snow Labs
name: roberta_token_classifier_bne_capitel_ner_pipeline
date: 2022-04-20
tags: [roberta, token_classifier, spanish, ner, es, open_source]
task: Named Entity Recognition
language: es
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [roberta_token_classifier_bne_capitel_ner_es](https://nlp.johnsnowlabs.com/2021/12/07/roberta_token_classifier_bne_capitel_ner_es.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bne_capitel_ner_pipeline_es_3.4.1_3.0_1650450203759.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bne_capitel_ner_pipeline_es_3.4.1_3.0_1650450203759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("roberta_token_classifier_bne_capitel_ner_pipeline", lang = "es")
pipeline.annotate("Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid.")
```
```scala
val pipeline = new PretrainedPipeline("roberta_token_classifier_bne_capitel_ner_pipeline", lang = "es")
pipeline.annotate("Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid.")
```
## Results
```bash
+------------------------+---------+
|chunk |ner_label|
+------------------------+---------+
|Antonio |PER |
|fábrica de Mercedes-Benz|ORG |
|Madrid |LOC |
+------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_token_classifier_bne_capitel_ner_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Language:|es|
|Size:|459.4 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- RoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Fast Neural Machine Translation Model from Greek Languages to English
author: John Snow Labs
name: opus_mt_grk_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, grk, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `grk`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_grk_en_xx_2.7.0_2.4_1609166792734.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_grk_en_xx_2.7.0_2.4_1609166792734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_grk_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_grk_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.grk.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_grk_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Utilities Clause Binary Classifier
author: John Snow Labs
name: legclf_utilities_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `utilities` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `utilities`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_utilities_clause_en_1.0.0_3.2_1660123178356.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_utilities_clause_en_1.0.0_3.2_1660123178356.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[utilities]|
|[other]|
|[other]|
|[utilities]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_utilities_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.99 0.98 0.99 105
utilities 0.93 0.97 0.95 29
accuracy - - 0.98 134
macro-avg 0.96 0.97 0.97 134
weighted-avg 0.98 0.98 0.98 134
```
---
layout: model
title: Word2Vec Embeddings in Pashto (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, ps, open_source]
task: Embeddings
language: ps
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ps_3.4.1_3.0_1647451184427.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ps_3.4.1_3.0_1647451184427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ps") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["زه د سپارک الاپ خوښوم"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ps")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("زه د سپارک الاپ خوښوم").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ps.embed.w2v_cc_300d").predict("""زه د سپارک الاپ خوښوم""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|ps|
|Size:|170.5 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: English XLMRobertaForTokenClassification Large Cased model (from asahi417)
author: John Snow Labs
name: xlmroberta_ner_tner_large_ontonotes5
date: 2022-08-13
tags: [en, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tner-xlm-roberta-large-ontonotes5` is a English model originally trained by `asahi417`.
## Predicted Entities
`language`, `time`, `percent`, `quantity`, `product`, `ordinal number`, `cardinal number`, `event`, `geopolitical area`, `facility`, `organization`, `work of art`, `group`, `money`, `law`, `person`, `location`, `date`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_ontonotes5_en_4.1.0_3.0_1660425115455.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_ontonotes5_en_4.1.0_3.0_1660425115455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_ontonotes5","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_ontonotes5","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_tner_large_ontonotes5|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|1.8 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/asahi417/tner-xlm-roberta-large-ontonotes5
- https://github.com/asahi417/tner
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465517
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465517` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465517_en_4.0.0_3.0_1655986325897.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465517_en_4.0.0_3.0_1655986325897.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465517","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465517","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465517.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465517|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|887.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465517
---
layout: model
title: Sentence Entity Resolver for UMLS CUI Codes
author: John Snow Labs
name: sbiobertresolve_umls_findings
date: 2021-10-03
tags: [entity_resolution, licensed, clinical, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.2.3
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts to 4 major categories of UMLS CUI codes using sbiobert_base_cased_mli Sentence Bert Embeddings. It has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements.
This model returns CUI (concept unique identifier) codes for 200K concepts from clinical findings.https://www.nlm.nih.gov/research/umls/index.html
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_UMLS_CUI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_3.2.3_3.0_1633220877215.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_3.2.3_3.0_1633220877215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
stopwords = StopWordsCleaner.pretrained()\
.setInputCols("token")\
.setOutputCol("cleanTokens")\
.setCaseSensitive(False)
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "cleanTokens"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "cleanTokens", "ner"]) \
.setOutputCol("ner_chunk")
chunk2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_findings","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver])
data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text")
results = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val stopwords = StopWordsCleaner.pretrained()
.setInputCols("token")
.setOutputCol("cleanTokens")
.setCaseSensitive(False)
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "cleanTokens"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "cleanTokens", "ner"))
.setOutputCol("entities")
val chunk2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models")
.setInputCols("ner_chunk_doc")
.setOutputCol("sbert_embeddings")
val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_findings", "en", "clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val p_model = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver))
val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""").toDS().toDF("text")
val res = p_model.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.umls.findings").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""")
```
## Results
```bash
| | ner_chunk | cui_code |
|---:|:--------------------------------------|:-----------|
| 0 | gestational diabetes mellitus | C2183115 |
| 1 | subsequent type two diabetes mellitus | C3532488 |
| 2 | T2DM | C3280267 |
| 3 | HTG-induced pancreatitis | C4554179 |
| 4 | an acute hepatitis | C4750596 |
| 5 | obesity | C1963185 |
| 6 | a body mass index | C0578022 |
| 7 | polyuria | C3278312 |
| 8 | polydipsia | C3278316 |
| 9 | poor appetite | C0541799 |
| 10 | vomiting | C0042963 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_umls_findings|
|Compatibility:|Healthcare NLP 3.2.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_chunk_embeddings]|
|Output Labels:|[umls_code]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on 200K concepts from clinical findings.https://www.nlm.nih.gov/research/umls/index.html
---
layout: model
title: Legal Withholdings Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_withholdings_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, withholdings, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Withholdings` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Withholdings`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_withholdings_bert_en_1.0.0_3.0_1678049980404.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_withholdings_bert_en_1.0.0_3.0_1678049980404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Withholdings]|
|[Other]|
|[Other]|
|[Withholdings]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_withholdings_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.99 0.98 0.98 85
Withholdings 0.97 0.98 0.98 61
accuracy - - 0.98 146
macro-avg 0.98 0.98 0.98 146
weighted-avg 0.98 0.98 0.98 146
```
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from seongju)
author: John Snow Labs
name: xlm_roberta_qa_squadv2_xlm_roberta_base
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squadv2-xlm-roberta-base` is a English model originally trained by `seongju`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_squadv2_xlm_roberta_base_en_4.0.0_3.0_1655988029859.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_squadv2_xlm_roberta_base_en_4.0.0_3.0_1655988029859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_squadv2_xlm_roberta_base","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_squadv2_xlm_roberta_base","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.xlm_roberta.base_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_squadv2_xlm_roberta_base|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|875.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/seongju/squadv2-xlm-roberta-base
- https://rajpurkar.github.io/SQuAD-explorer/
---
layout: model
title: English image_classifier_vit_denver_nyc_paris ViTForImageClassification from nateraw
author: John Snow Labs
name: image_classifier_vit_denver_nyc_paris
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_denver_nyc_paris` is a English model originally trained by nateraw.
## Predicted Entities
`denver`, `new york city`, `paris`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_denver_nyc_paris_en_4.1.0_3.0_1660172026182.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_denver_nyc_paris_en_4.1.0_3.0_1660172026182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_denver_nyc_paris", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_denver_nyc_paris", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_denver_nyc_paris|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Embeddings Scielo 50 dims
author: John Snow Labs
name: embeddings_scielo_50d
class: WordEmbeddingsModel
language: es
repository: clinical/models
date: 2020-05-26
task: Embeddings
edition: Healthcare NLP 2.5.0
spark_version: 2.4
tags: [clinical,embeddings,es]
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_scielo_50d_es_2.5.0_2.4_1590467114993.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_scielo_50d_es_2.5.0_2.4_1590467114993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
model = WordEmbeddingsModel.pretrained("embeddings_scielo_50d","es","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
```
```scala
val model = WordEmbeddingsModel.pretrained("embeddings_scielo_50d","es","clinical/models")
.setInputCols("document","token")
.setOutputCol("word_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("es.embed.scielo.50d").predict("""Put your text here.""")
```
{:.h2_title}
## Results
Word2Vec feature vectors based on ``embeddings_scielo_50d``.
{:.model-param}
## Model Information
{:.table-model}
|---------------|-----------------------|
| Name: | embeddings_scielo_50d |
| Type: | WordEmbeddingsModel |
| Compatibility: | Spark NLP 2.5.0+ |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [document, token] |
|Output labels: | [word_embeddings] |
| Language: | es |
| Dimension: | 50.0 |
{:.h2_title}
## Data Source
Trained on Scielo Articles
https://zenodo.org/record/3744326#.XtViinVKh_U
---
layout: model
title: Spanish RobertaForQuestionAnswering Base Cased model (from IIC)
author: John Snow Labs
name: roberta_qa_base_spanish_squades
date: 2022-12-02
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades` is a Spanish model originally trained by `IIC`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_es_4.2.4_3.0_1669986476235.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_es_4.2.4_3.0_1669986476235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_spanish_squades|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|459.6 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/IIC/roberta-base-spanish-squades
- https://arxiv.org/abs/2107.07253
- https://paperswithcode.com/sota?task=question-answering&dataset=squad_es
---
layout: model
title: Detect Clinical Conditions (ner_eu_clinical_condition - es)
author: John Snow Labs
name: ner_eu_clinical_condition
date: 2023-02-06
tags: [es, clinical, licensed, ner, clinical_condition]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.2.8
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition (NER) deep learning model for extracting clinical conditions from Spanish texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN.
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Predicted Entities
`clinical_condition`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_es_4.2.8_3.0_1675721390266.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_es_4.2.8_3.0_1675721390266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained('ner_eu_clinical_condition', "es", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["""La exploración abdominal revela una cicatriz de laparotomía media infraumbilical, la presencia de ruidos disminuidos, y dolor a la palpación de manera difusa sin claros signos de irritación peritoneal. No existen hernias inguinales o crurales."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_condition", "es", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter))
val data = Seq(Array("""La exploración abdominal revela una cicatriz de laparotomía media infraumbilical, la presencia de ruidos disminuidos, y dolor a la palpación de manera difusa sin claros signos de irritación peritoneal. No existen hernias inguinales o crurales.""")).toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+--------------------+------------------+
|chunk |ner_label |
+--------------------+------------------+
|cicatriz |clinical_condition|
|dolor a la palpación|clinical_condition|
|signos |clinical_condition|
|irritación |clinical_condition|
|hernias inguinales |clinical_condition|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_eu_clinical_condition|
|Compatibility:|Healthcare NLP 4.2.8+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|898.1 KB|
## References
The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
## Benchmarking
```bash
label tp fp fn total precision recall f1
clinical_condition 354.0 42.0 84.0 438.0 0.8939 0.8082 0.8489
macro - - - - - - 0.8489
micro - - - - - - 0.8489
```
---
layout: model
title: NER Model Finder with Sentence Entity Resolvers (SBert, Medium, Uncased)
author: John Snow Labs
name: sbertresolve_ner_model_finder
date: 2022-01-17
tags: [ner, licensed, clinical, entity_resolver, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.3.2
spark_version: 2.4
supported: true
recommended: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities (NER labels) to the most appropriate NER model using `sbert_jsl_medium_uncased` Sentence Bert Embeddings. Given the entity name, it will return a list of pretrained NER models having that entity or similar ones.
## Predicted Entities
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_3.3.2_2.4_1642422477025.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_3.3.2_2.4_1642422477025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
ner_model_finder = SentenceEntityResolverModel\
.pretrained("sbertresolve_ner_model_finder", "en", "clinical/models")\
.setInputCols(["ner_chunk", "sbert_embeddings"])\
.setOutputCol("model_names")\
.setDistanceFunction("EUCLIDEAN")
ner_model_finder_pipelineModel = PipelineModel(stages = [documentAssembler, sbert_embedder, ner_model_finder])
light_pipeline = LightPipeline(ner_model_finder_pipelineModel)
annotations = light_pipeline.fullAnnotate("medication")
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val ner_model_finder = SentenceEntityResolverModel
.pretrained("sbertresolve_ner_model_finder", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("model_names")
.setDistanceFunction("EUCLIDEAN")
val ner_model_finder_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, ner_model_finder))
val light_pipeline = LightPipeline(ner_model_finder_pipelineModel)
val annotations = light_pipeline.fullAnnotate("medication")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.ner.model_finder").predict("""Put your text here.""")
```
## Results
```bash
+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity |models |all_models |resolutions |
+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|medication|['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_ade_clinical', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']|['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_ade_clinical', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']:::['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_clinical_large', 'ner_healthcare', 'ner_jsl_enriched', 'ner_clinical', 'ner_jsl_slim', 'ner_covid_trials', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_events_admission_clinical', 'ner_events_healthcare', 'ner_events_clinical', 'ner_jsl_greedy']:::['ner_medmentions_coarse']:::['ner_jsl_enriched', 'ner_covid_trials', 'ner_jsl', 'ner_medmentions_coarse']:::['ner_drugs']:::['ner_clinical_icdem', 'ner_medmentions_coarse']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_medmentions_coarse', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_medmentions_coarse', 'ner_radiology_wip_clinical', 'ner_jsl_slim', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy', 'ner_radiology']:::['ner_medmentions_coarse','ner_clinical_icdem']:::['ner_posology_experimental']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_measurements_clinical', 'ner_radiology_wip_clinical', 'ner_radiology']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_posology_small', 'ner_jsl_enriched', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_posology_greedy', 'ner_posology', 'ner_jsl_greedy']:::['ner_covid_trials', 'ner_medmentions_coarse', 'jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_deid_subentity_augmented', 'ner_deid_subentity_glove', 'ner_deidentify_dl', 'ner_deid_enriched']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_covid_trials', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_medmentions_coarse', 'jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_chemd_clinical']|medication:::drug:::treatment:::therapeutic procedure:::drug ingredient:::drug chemical:::diagnostic aid:::substance:::medical device:::diagnostic procedure:::administration:::measurement:::drug strength:::physiological reaction:::patient:::vaccine:::psychological condition:::abbreviation|
+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbertresolve_ner_model_finder|
|Compatibility:|Healthcare NLP 3.3.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sbert_embeddings]|
|Output Labels:|[models]|
|Language:|en|
|Size:|611.1 KB|
|Case sensitive:|false|
---
layout: model
title: Fast Neural Machine Translation Model from Afrikaans to Swedish
author: John Snow Labs
name: opus_mt_af_sv
date: 2021-06-01
tags: [open_source, seq2seq, translation, af, sv, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: af
target languages: sv
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_sv_xx_3.1.0_2.4_1622562929786.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_sv_xx_3.1.0_2.4_1622562929786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_af_sv", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_af_sv", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Afrikaans.translate_to.Swedish').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_af_sv|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune TFWav2Vec2ForCTC from hrdipto
author: John Snow Labs
name: pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune` is a English model originally trained by hrdipto.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune_en_4.2.0_3.0_1664041738598.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune_en_4.2.0_3.0_1664041738598.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Translate English to Dutch Pipeline
author: John Snow Labs
name: translate_en_nl
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, nl, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `nl`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_nl_xx_2.7.0_2.4_1609688238018.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_nl_xx_2.7.0_2.4_1609688238018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_nl", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_nl", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.nl').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_nl|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Named Entity Recognition - BERT Base (OntoNotes)
author: John Snow Labs
name: onto_bert_base_cased
date: 2020-12-05
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [ner, en, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc.
This model uses the pretrained `bert_base_cased` embeddings model from `BertEmbeddings` annotator as an input.
## Predicted Entities
`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_bert_base_cased_en_2.7.0_2.4_1607197077494.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_bert_base_cased_en_2.7.0_2.4_1607197077494.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("bert_base_cased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
ner_onto = NerDLModel.pretrained("onto_bert_base_cased", "en") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_onto = NerDLModel.pretrained("onto_bert_base_cased", "en")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter))
val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""]
ner_df = nlu.load('en.ner.onto.bert.cased_base').predict(text, output_level='chunk')
ner_df[["entities", "entities_class"]]
```
{:.h2_title}
## Results
```bash
+-----------------------+---------+
|chunk |ner_label|
+-----------------------+---------+
|William Henry Gates III|PERSON |
|October 28, 1955 |DATE |
|American |NORP |
|Microsoft Corporation |ORG |
|Microsoft |ORG |
|Gates |PERSON |
|May 2014 |DATE |
|one |CARDINAL |
|the 1970s and 1980s |DATE |
|Seattle |GPE |
|Washington |GPE |
|Gates |PERSON |
|Paul Allen |PERSON |
|1975 |DATE |
|Albuquerque |GPE |
|New Mexico |GPE |
|Gates |ORG |
|January 2000 |DATE |
|the late 1990s |DATE |
|Gates |PERSON |
+-----------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|onto_bert_base_cased|
|Type:|ner|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
The model is trained based on data from [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19)
## Benchmarking
```bash
Micro-average:
prec: 0.8987879, rec: 0.90063596, f1: 0.89971095
CoNLL Eval:
processed 152728 tokens with 11257 phrases; found: 11276 phrases; correct: 10006.
accuracy: 98.01%; 10006 11257 11276 precision: 88.74%; recall: 88.89%; FB1: 88.81
CARDINAL: 822 935 990 precision: 83.03%; recall: 87.91%; FB1: 85.40 990
DATE: 1355 1602 1567 precision: 86.47%; recall: 84.58%; FB1: 85.52 1567
EVENT: 32 63 59 precision: 54.24%; recall: 50.79%; FB1: 52.46 59
FAC: 96 135 124 precision: 77.42%; recall: 71.11%; FB1: 74.13 124
GPE: 2116 2240 2182 precision: 96.98%; recall: 94.46%; FB1: 95.70 2182
LANGUAGE: 10 22 11 precision: 90.91%; recall: 45.45%; FB1: 60.61 11
LAW: 21 40 28 precision: 75.00%; recall: 52.50%; FB1: 61.76 28
LOC: 141 179 178 precision: 79.21%; recall: 78.77%; FB1: 78.99 178
MONEY: 278 314 321 precision: 86.60%; recall: 88.54%; FB1: 87.56 321
NORP: 799 841 850 precision: 94.00%; recall: 95.01%; FB1: 94.50 850
ORDINAL: 177 195 217 precision: 81.57%; recall: 90.77%; FB1: 85.92 217
ORG: 1606 1795 1848 precision: 86.90%; recall: 89.47%; FB1: 88.17 1848
PERCENT: 306 349 344 precision: 88.95%; recall: 87.68%; FB1: 88.31 344
PERSON: 1856 1988 1978 precision: 93.83%; recall: 93.36%; FB1: 93.60 1978
PRODUCT: 54 76 76 precision: 71.05%; recall: 71.05%; FB1: 71.05 76
QUANTITY: 87 105 108 precision: 80.56%; recall: 82.86%; FB1: 81.69 108
TIME: 143 212 216 precision: 66.20%; recall: 67.45%; FB1: 66.82 216
WORK_OF_ART: 107 166 179 precision: 59.78%; recall: 64.46%; FB1: 62.03 179
```
---
layout: model
title: Traditional Chinese Word Segmentation
author: John Snow Labs
name: wordseg_gsd_ud_trad
date: 2021-01-25
task: Word Segmentation
language: zh
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [word_segmentation, zh, open_source]
supported: true
annotator: WordSegmenterModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. This model was trained on *traditional characters* in Chinese texts.
Reference:
- Xue, Nianwen. “Chinese word segmentation as character tagging.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_trad_zh_2.7.0_2.4_1611584735643.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_trad_zh_2.7.0_2.4_1611584735643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline as a substitute for the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh")\
.setInputCols(["sentence"])\
.setOutputCol("token")
pipeline = Pipeline(stages=[document_assembler,sentence_detector, word_segmenter])
ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame([['然而,這樣的處理也衍生了一些問題。']], ["text"])
result = ws_model.transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh")
.setInputCols(Array("sentence"))
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler,sentence_detector, word_segmenter))
val data = Seq("然而,這樣的處理也衍生了一些問題。").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""然而,這樣的處理也衍生了一些問題。"""]
token_df = nlu.load('zh.segment_words.gsd').predict(text)
token_df
```
## Results
```bash
+-----------------------------------------+-----------------------------------------------------------+
|text | result |
+-----------------------------------------+-----------------------------------------------------------+
|然而 , 這樣 的 處理 也 衍生 了 一些 問題 。 |[然而, ,, 這樣, 的, 處理, 也, 衍生, 了, 一些, 問題, 。] |
+-----------------------------------------+-----------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wordseg_gsd_ud_trad|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[token]|
|Language:|zh|
## Data Source
The model was trained on the [Universal Dependencies](https://universaldependencies.org/) for Traditional Chinese annotated and converted by Google.
## Benchmarking
```bash
| precision | recall | f1-score |
|--------------|----------|------------|
| 0.7392 | 0.7754 | 0.7569 |
```
---
layout: model
title: Legal Distributions Clause Binary Classifier
author: John Snow Labs
name: legclf_distributions_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `distributions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `distributions`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_distributions_clause_en_1.0.0_3.2_1660123426386.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_distributions_clause_en_1.0.0_3.2_1660123426386.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[master-administrative-services-agreement]|
|[other]|
|[other]|
|[master-administrative-services-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_master_administrative_services_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.2 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
master-administrative-services-agreement 1.00 1.00 1.00 54
other 1.00 1.00 1.00 101
accuracy - - 1.00 155
macro-avg 1.00 1.00 1.00 155
weighted-avg 1.00 1.00 1.00 155
```
---
layout: model
title: Clinical Deidentification (English, Glove, Augmented)
author: John Snow Labs
name: clinical_deidentification_glove_augmented
date: 2022-03-22
tags: [deid, deidentification, en, licensed]
task: De-identification
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
recommended: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline is trained with lightweight glove_100d embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR` entities.
It's different to `clinical_deidentification_glove` in the way it manages PHONE and PATIENT, having apart from the NER, some rules in Contextual Parser components.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_augmented_en_3.4.1_3.0_1647966639326.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_augmented_en_3.4.1_3.0_1647966639326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification_glove", "en", "clinical/models")
deid_pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = PretrainedPipeline("clinical_deidentification_glove","en","clinical/models")
val result = pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.glove_augmented.pipeline").predict("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""")
```
## Results
```bash
{'sentence': ['Record date : 2093-01-13, David Hale, M.D.',
'IP: 203.120.223.13.',
'The driver's license no:A334455B.',
'the SSN:324598674 and e-mail: hale@gmail.com.',
'Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93.',
'PCP : Oliveira, 25 years-old.',
'Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.'],
'masked': ['Record date : , , M.D.',
'IP: .',
'The driver's license .',
'the and e-mail: .',
'Name : MR. # Date : .',
'PCP : , years-old.',
'Record date : , Patient's VIN : .'],
'obfuscated': ['Record date : 2093-02-13, Shella Solan, M.D.',
'IP: 444.444.444.444.',
'The driver's license O497302436569.',
'the SSN-539-29-1060 and e-mail: Keith@google.com.',
'Name : Roscoe Kerns MR. # Q984288 Date : 10-08-1991.',
'PCP : Dr Rudell Dubin, 10 years-old.',
'Record date : 2079-12-30, Patient's VIN : 5eeee44ffff555666.'],
'ner_chunk': ['2093-01-13',
'David Hale',
'no:A334455B',
'SSN:324598674',
'Hendrickson, Ora',
'719435',
'01/13/93',
'Oliveira',
'25',
'2079-11-09',
'1HGBH41JXMN109286']}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification_glove_augmented|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|181.3 MB|
## Included Models
- nlp.DocumentAssembler
- nlp.SentenceDetector
- nlp.TokenizerModel
- nlp.WordEmbeddingsModel
- medical.NerModel
- medical.NerConverterInternal
- medical.NerModel
- medical.NerConverterInternal
- ChunkMergeModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- ChunkMergeModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- Finisher
---
layout: model
title: English BertForQuestionAnswering model (from madlag)
author: John Snow Labs
name: bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squadv1-x1.16-f88.1-d8-unstruct-v1` is a English model orginally trained by `madlag`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1_en_4.0.0_3.0_1654181545377.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1_en_4.0.0_3.0_1654181545377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base_uncased.x1.16_f88.1_d8_unstruct.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|146.1 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/madlag/bert-base-uncased-squadv1-x1.16-f88.1-d8-unstruct-v1
- https://rajpurkar.github.io/SQuAD-explorer
- https://www.aclweb.org/anthology/N19-1423.pdf
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_el8_dl2
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el8-dl2` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_dl2_en_4.3.0_3.0_1675120612912.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_dl2_en_4.3.0_3.0_1675120612912.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_el8_dl2","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_el8_dl2","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_el8_dl2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|127.7 MB|
## References
- https://huggingface.co/google/t5-efficient-small-el8-dl2
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English T5ForConditionalGeneration Cased model (from ybagoury)
author: John Snow Labs
name: t5_flan_base_tldr_news
date: 2023-03-02
tags: [open_source, t5, flan, en, tensorflow]
task: Text Generation
language: en
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. flan-t5-base-tldr_news is a English model originally trained by ybagoury.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_flan_base_tldr_news_en_4.3.0_3.0_1677760144575.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_flan_base_tldr_news_en_4.3.0_3.0_1677760144575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_flan_base_tldr_news","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_flan_base_tldr_news","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_flan_base_tldr_news|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|1.0 GB|
## References
https://huggingface.co/ybagoury/flan-t5-base-tldr_news
---
layout: model
title: German asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673` is a German model originally trained by jonatasgrosman.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673_de_4.2.0_3.0_1664114173921.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673_de_4.2.0_3.0_1664114173921.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: Part of Speech for Amharic (pos_ud_att)
author: John Snow Labs
name: pos_ud_att
date: 2021-01-20
task: Part of Speech Tagging
language: am
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [am, pos, open_source]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
## Predicted Entities
| POS tag | Description |
|---------|----------------------------|
| ADJ | adjective |
| ADP | adposition |
| ADV | adverb |
| AUX | auxiliary |
| CCONJ | coordinating conjunction |
| DET | determiner |
| INTJ | interjection |
| NOUN | noun |
| NUM | numeral |
| PART | particle |
| PRON | pronoun |
| PROPN | proper noun |
| PUNCT | punctuation |
| SCONJ | subordinating conjunction |
| SYM | symbol |
| VERB | verb |
| X | other |
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_att_am_2.7.0_2.4_1611180723328.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_att_am_2.7.0_2.4_1611180723328.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_ud_att", "am") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
posTagger
])
example = spark.createDataFrame([['ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_ud_att", "am")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።"]
pos_df = nlu.load('am.pos').predict(text)
pos_df
```
## Results
```bash
+------------------------------+----------------------------------------------------------------+
|text |result |
+------------------------------+----------------------------------------------------------------+
|ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።|[NOUN, DET, PART, NOUN, DET, PART, VERB, PRON, AUX, PRON, PUNCT]|
+------------------------------+----------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_att|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[pos]|
|Language:|am|
## Data Source
The model was trained on the [Universal Dependencies](https://universaldependencies.org/) version 2.7.
Reference:
- Binyam Ephrem Seyoum ,Yusuke Miyao and Baye Yimam Mekonnen.2018.Universal Dependencies for Amharic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pp. 2216–2222, Miyazaki, Japan: European Language Resources Association (ELRA)
## Benchmarking
```bash
| | precision | recall | f1-score | support |
|:------------:|:---------:|:------:|:--------:|:-------:|
| ADJ | 1.00 | 0.97 | 0.99 | 116 |
| ADP | 0.99 | 1.00 | 0.99 | 681 |
| ADV | 0.94 | 0.99 | 0.96 | 93 |
| AUX | 1.00 | 1.00 | 1.00 | 419 |
| CCONJ | 0.99 | 0.97 | 0.98 | 99 |
| DET | 0.99 | 1.00 | 0.99 | 485 |
| INTJ | 0.97 | 0.99 | 0.98 | 67 |
| NOUN | 0.99 | 1.00 | 1.00 | 1485 |
| NUM | 1.00 | 1.00 | 1.00 | 42 |
| PART | 1.00 | 1.00 | 1.00 | 875 |
| PRON | 1.00 | 1.00 | 1.00 | 2547 |
| PROPN | 1.00 | 0.99 | 0.99 | 236 |
| PUNCT | 1.00 | 1.00 | 1.00 | 1093 |
| SCONJ | 1.00 | 0.98 | 0.99 | 214 |
| VERB | 1.00 | 1.00 | 1.00 | 1552 |
| accuracy | | | 1.00 | 10004 |
| macro avg | 0.99 | 0.99 | 0.99 | 10004 |
| weighted avg | 1.00 | 1.00 | 1.00 | 10004 |
```
---
layout: model
title: Legal Approvals Clause Binary Classifier
author: John Snow Labs
name: legclf_approvals_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `approvals` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `approvals`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_approvals_clause_en_1.0.0_3.2_1660123231676.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_approvals_clause_en_1.0.0_3.2_1660123231676.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[approvals]|
|[other]|
|[other]|
|[approvals]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_approvals_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
approvals 0.89 0.94 0.91 33
other 0.97 0.95 0.96 82
accuracy - - 0.95 115
macro-avg 0.93 0.95 0.94 115
weighted-avg 0.95 0.95 0.95 115
```
---
layout: model
title: Ocr pipeline in streaming
author: John Snow Labs
name: ocr_streaming
date: 2023-01-03
tags: [en, licensed, ocr, streaming]
task: Ocr Streaming
language: en
nav_key: models
edition: Visual NLP 4.0.0
spark_version: 3.0
supported: true
annotator: OcrStreaming
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Streaming pipeline implementation for the OCR task, using tesseract models. Tesseract is an Optical Character Recognition (OCR) engine developed by Google. It is an open-source tool that can be used to recognize text in images and convert it into machine-readable text. The engine is based on a neural network architecture and uses machine learning algorithms to improve its accuracy over time.
Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset.
In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy.
## Predicted Entities
{:.btn-box}
[Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/6.1.SparkOcrStreamingPDF.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
# Transform binary to image
pdf_to_image = PdfToImage() \
.setInputCol("content") \
.setOutputCol("image")
# Run OCR for each region
ocr = ImageToText() \
.setInputCol("image") \
.setOutputCol("text") \
.setConfidenceThreshold(60)
# OCR pipeline
pipeline = PipelineModel(stages=[
pdf_to_image,
ocr])
# fill path to folder with PDF's here
dataset_path = "data/pdfs/*.pdf"
# read one file for infer schema
pdfs_df = spark.read.format("binaryFile").load(dataset_path).limit(1)
# count of files in one microbatch
maxFilesPerTrigger = 4
# read files as stream
pdf_stream_df = spark.readStream \
.format("binaryFile") \
.schema(pdfs_df.schema) \
.option("maxFilesPerTrigger", maxFilesPerTrigger) \
.load(dataset_path)
# process files using OCR pipeline
result = pipeline.transform(pdf_stream_df).withColumn("timestamp", current_timestamp())
# store results to memory table
query = result.writeStream \
.format('memory') \
.queryName('result') \
.start()
# show results
spark.table("result").select("timestamp","pagenum", "path", "text").show(10)
```
```scala
# Transform binary to image
val pdf_to_image = new PdfToImage()
.setInputCol("content")
.setOutputCol("image")
# Run OCR for each region
val ocr = new ImageToText()
.setInputCol("image")
.setOutputCol("text")
.setConfidenceThreshold(60)
# OCR pipeline
val pipeline = new PipelineModel().setStages(Array(
pdf_to_image,
ocr))
# fill path to folder with PDF's here
val dataset_path = "data/pdfs/*.pdf"
# read one file for infer schema
val pdfs_df = spark.read.format("binaryFile").load(dataset_path).limit(1)
# count of files in one microbatch
val maxFilesPerTrigger = 4
# read files as stream
val pdf_stream_df = spark.readStream
.format("binaryFile")
.schema(pdfs_df.schema)
.option("maxFilesPerTrigger", maxFilesPerTrigger)
.load(dataset_path)
# process files using OCR pipeline
val result = pipeline.transform(pdf_stream_df).withColumn("timestamp", current_timestamp())
# store results to memory table
val query = result.writeStream
.format("memory")
.queryName("result")
.start()
# show results
spark.table("result").select(Array("timestamp","pagenum", "path", "text")).show(10)
```
## Example
### Input:

### Output
```bash
+--------------------+
| value|
+--------------------+
| |
| |
| |
| |
| |
| |
|ne Pa a Date: 7/1...|
|er ‘Sample No. _ ...|
|“ Original reques...|
| |
|Sample specificat...|
| , BLEND CASING R...|
| |
|- OLD GOLD STRAIG...|
| |
|Control for Sampl...|
| |
| Cigarettes:|
| |
| OLD GOLD STRAIGHT|
+--------------------+
only showing top 20 rows
```
## Model Information
{:.table-model}
|---|---|
|Model Name:|ocr_streaming|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
---
layout: model
title: Legal Indemnification Procedures Clause Binary Classifier
author: John Snow Labs
name: legclf_indemnification_procedures_clause
date: 2023-01-27
tags: [en, legal, classification, indemnification, procedures, clauses, indemnification_procedures, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `indemnification-procedures` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`indemnification-procedures`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_procedures_clause_en_1.0.0_3.0_1674819938050.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_procedures_clause_en_1.0.0_3.0_1674819938050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[indemnification-procedures]|
|[other]|
|[other]|
|[indemnification-procedures]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_indemnification_procedures_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
indemnification-procedures 1.00 0.96 0.98 23
other 0.97 1.00 0.99 39
accuracy - - 0.98 62
macro-avg 0.99 0.98 0.98 62
weighted-avg 0.98 0.98 0.98 62
```
---
layout: model
title: Detect Problems, Tests and Treatments (ner_clinical) in German
author: John Snow Labs
name: ner_clinical
date: 2023-05-08
tags: [ner, clinical, licensed, de]
task: Named Entity Recognition
language: de
edition: Healthcare NLP 4.4.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terms in German. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
`PROBLEM`, `TEST`, `TREATMENT`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_de_4.4.0_3.0_1683555292486.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_de_4.4.0_3.0_1683555292486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "de", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
sample_text= """Verschlechterung von Schmerzen oder Schwäche in den Beinen , Verlust der Darm - oder Blasenfunktion oder andere besorgniserregende Symptome.
Der Patient erhielt empirisch Ampicillin , Gentamycin und Flagyl sowie Narcan zur Umkehrung von Fentanyl .
ALT war 181 , AST war 156 , LDH war 336 , alkalische Phosphatase war 214 und Bilirubin war insgesamt 12,7 ."""
results = model.transform(spark.createDataFrame([[sample_text]], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_clinical", "de", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""Verschlechterung von Schmerzen oder Schwäche in den Beinen , Verlust der Darm - oder Blasenfunktion oder andere besorgniserregende Symptome.
Der Patient erhielt empirisch Ampicillin , Gentamycin und Flagyl sowie Narcan zur Umkehrung von Fentanyl .
ALT war 181 , AST war 156 , LDH war 336 , alkalische Phosphatase war 214 und Bilirubin war insgesamt 12,7 .""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+----------------------+---------+
|chunk |ner_label|
+----------------------+---------+
|Schmerzen |PROBLEM |
|Schwäche in den Beinen|PROBLEM |
|Verlust der Darm |PROBLEM |
|Blasenfunktion |PROBLEM |
|Symptome |PROBLEM |
|empirisch Ampicillin |TREATMENT|
|Gentamycin |TREATMENT|
|Flagyl |TREATMENT|
|Narcan |TREATMENT|
|Fentanyl |TREATMENT|
|ALT |TEST |
|AST |TEST |
|LDH |TEST |
|alkalische Phosphatase|TEST |
|Bilirubin |TEST |
+----------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_clinical|
|Compatibility:|Healthcare NLP 4.4.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|2.0 MB|
## Benchmarking
```bash
label precision recall f1-score support
B-PROBLEM 0.85 0.71 0.78 512
B-TEST 0.89 0.85 0.87 203
B-TREATMENT 0.84 0.82 0.83 238
I-PROBLEM 0.78 0.70 0.74 355
I-TEST 0.90 0.83 0.87 66
I-TREATMENT 0.62 0.71 0.66 75
O 0.94 0.97 0.95 4141
accuracy - - 0.91 5590
macro avg 0.83 0.80 0.81 5590
weighted avg 0.91 0.91 0.91 5590
```
---
layout: model
title: English DistilBertForTokenClassification Cased model (from f2io)
author: John Snow Labs
name: distilbert_token_classifier_ner_roles_openapi
date: 2023-03-14
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ner-roles-openapi` is a English model originally trained by `f2io`.
## Predicted Entities
``, `MISC`, `ORG`, `ENTITY`, `PER`, `PRG`, `ROLE`, `OR`, `LOC`, `ACTION`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_ner_roles_openapi_en_4.3.1_3.0_1678782949346.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_ner_roles_openapi_en_4.3.1_3.0_1678782949346.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_ner_roles_openapi","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_ner_roles_openapi","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_ner_roles_openapi|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/f2io/ner-roles-openapi
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from deepset)
author: John Snow Labs
name: roberta_qa_deepset_base_squad2
date: 2022-12-02
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
recommended: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_en_4.2.4_3.0_1669986722225.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_en_4.2.4_3.0_1669986722225.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_deepset_base_squad2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/deepset/roberta-base-squad2
- https://haystack.deepset.ai/tutorials/first-qa-system
- https://github.com/deepset-ai/haystack/
- https://haystack.deepset.ai/tutorials/first-qa-system
- https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
- http://deepset.ai/
- https://haystack.deepset.ai/
- https://deepset.ai/german-bert
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/haystack
- https://docs.haystack.deepset.ai
- https://haystack.deepset.ai/community
- https://twitter.com/deepset_ai
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- http://www.deepset.ai/jobs
- https://paperswithcode.com/sota?task=Question+Answering&dataset=squad_v2
---
layout: model
title: Fast Neural Machine Translation Model from Catalan to English
author: John Snow Labs
name: opus_mt_ca_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, ca, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `ca`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ca_en_xx_2.7.0_2.4_1609169355424.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ca_en_xx_2.7.0_2.4_1609169355424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ca_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ca_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.ca.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ca_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Detect PHI for Deidentification (Glove, 7 labels)
author: John Snow Labs
name: ner_deid_generic_glove_pipeline
date: 2023-03-13
tags: [deid, clinical, glove, licensed, ner, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_generic_glove](https://nlp.johnsnowlabs.com/2021/06/06/ner_deid_generic_glove_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_glove_pipeline_en_4.3.0_3.2_1678734514341.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_glove_pipeline_en_4.3.0_3.2_1678734514341.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_generic_glove_pipeline", "en", "clinical/models")
text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_deid_generic_glove_pipeline", "en", "clinical/models")
val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:------------------------------|--------:|------:|:------------|-------------:|
| 0 | 2093-01-13 | 14 | 23 | DATE | 1 |
| 1 | David Hale | 27 | 36 | NAME | 0.9938 |
| 2 | Hendrickson Ora | 55 | 69 | NAME | 0.992 |
| 3 | 7194334 | 78 | 84 | ID | 1 |
| 4 | 01/13/93 | 93 | 100 | DATE | 1 |
| 5 | Oliveira | 110 | 117 | NAME | 1 |
| 6 | 25 | 121 | 122 | AGE | 0.8724 |
| 7 | 2079-11-09 | 150 | 159 | DATE | 1 |
| 8 | Cocke County Baptist Hospital | 163 | 191 | LOCATION | 0.8586 |
| 9 | 0295 Keats Street | 195 | 211 | LOCATION | 0.948667 |
| 10 | 302-786-5227 | 221 | 232 | CONTACT | 0.9972 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_generic_glove_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|167.3 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Word2Vec Embeddings in Upper Sorbian (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, hsb, open_source]
task: Embeddings
language: hsb
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_hsb_3.4.1_3.0_1647465128124.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_hsb_3.4.1_3.0_1647465128124.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hsb") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hsb")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("hsb.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|hsb|
|Size:|144.4 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Persian Named Entity Recognition (from HooshvareLab)
author: John Snow Labs
name: bert_ner_bert_base_parsbert_ner_uncased
date: 2022-05-09
tags: [bert, ner, token_classification, fa, open_source]
task: Named Entity Recognition
language: fa
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-parsbert-ner-uncased` is a Persian model orginally trained by `HooshvareLab`.
## Predicted Entities
`percent`, `facility`, `location`, `money`, `product`, `person`, `date`, `organization`, `time`, `event`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_parsbert_ner_uncased_fa_3.4.2_3.0_1652099655453.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_parsbert_ner_uncased_fa_3.4.2_3.0_1652099655453.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_parsbert_ner_uncased","fa") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["من عاشق جرقه nlp هستم"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_parsbert_ner_uncased","fa")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("من عاشق جرقه nlp هستم").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_bert_base_parsbert_ner_uncased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|fa|
|Size:|607.1 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/HooshvareLab/bert-base-parsbert-ner-uncased
- https://arxiv.org/abs/2005.12515
- http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/
- https://github.com/HaniehP/PersianNER
- https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb
- https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb
- https://arxiv.org/abs/2005.12515
- https://tensorflow.org/tfrc
- https://hooshvare.com
- https://www.linkedin.com/in/m3hrdadfi/
- https://twitter.com/m3hrdadfi
- https://github.com/m3hrdadfi
- https://www.linkedin.com/in/mohammad-gharachorloo/
- https://twitter.com/MGharachorloo
- https://github.com/baarsaam
- https://www.linkedin.com/in/marziehphi/
- https://twitter.com/marziehphi
- https://github.com/marziehphi
- https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/
- https://twitter.com/mmanthouri
- https://github.com/mmanthouri
- https://hooshvare.com/
- https://www.linkedin.com/company/hooshvare
- https://twitter.com/hooshvare
- https://github.com/hooshvare
- https://www.instagram.com/hooshvare/
- https://www.linkedin.com/in/sara-tabrizi-64548b79/
- https://www.behance.net/saratabrizi
- https://www.instagram.com/sara_b_tabrizi/
---
layout: model
title: Legal Waiver Of Jury Trials Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_waiver_of_jury_trials_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, waiver_of_jury_trials, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Waiver_Of_Jury_Trials` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Waiver_Of_Jury_Trials`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_jury_trials_bert_en_1.0.0_3.0_1678050634484.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_jury_trials_bert_en_1.0.0_3.0_1678050634484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Waiver_Of_Jury_Trials]|
|[Other]|
|[Other]|
|[Waiver_Of_Jury_Trials]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_waiver_of_jury_trials_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.85 0.99 0.92 125
Waiver_Of_Jury_Trials 0.99 0.75 0.85 89
accuracy - - 0.89 214
macro-avg 0.92 0.87 0.88 214
weighted-avg 0.91 0.89 0.89 214
```
---
layout: model
title: Mapping RxNorm Codes with Corresponding Actions and Treatments
author: John Snow Labs
name: rxnorm_action_treatment_mapper
date: 2022-05-08
tags: [en, chunk_mapper, rxnorm, action, treatment, licensed, clinical]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 3.5.1
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps RxNorm and RxNorm Extension codes with their corresponding action and treatment. Action refers to the function of the drug in various body systems; treatment refers to which disease the drug is used to treat.
## Predicted Entities
`action`, `treatment`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_action_treatment_mapper_en_3.5.1_3.0_1652043181565.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_action_treatment_mapper_en_3.5.1_3.0_1652043181565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('ner_chunk')
sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
.setInputCols(["ner_chunk"])\
.setOutputCol("sentence_embeddings")\
.setCaseSensitive(False)
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
chunkerMapper_action = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\
.setInputCols(["rxnorm_code"])\
.setOutputCol("Action")\
.setRel("Action")
chunkerMapper_treatment = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\
.setInputCols(["rxnorm_code"])\
.setOutputCol("Treatment")\
.setRel("Treatment")
pipeline = Pipeline().setStages([document_assembler,
sbert_embedder,
rxnorm_resolver,
chunkerMapper_action,
chunkerMapper_treatment
])
model = pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
light_pipeline = LightPipeline(model)
result = light_pipeline.annotate(['Sinequan 150 MG', 'Zonalon 50 mg'])
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sentence_embeddings")
.setCaseSensitive(False)
val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models")
.setInputCols(Array("sentence_embeddings"))
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val chunkerMapper_action = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models"))
.setInputCols("rxnorm_code")
.setOutputCol("Action")
.setRel("Action")
val chunkerMapper_treatment = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models"))
.setInputCols("rxnorm_code")
.setOutputCol("Treatment")
.setRel("Treatment")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sbert_embedder,
rxnorm_resolver,
chunkerMapper_action,
chunkerMapper_treatment
))
val text_data = Seq("Sinequan 150 MG", "Zonalon 50 mg").toDS.toDF("text")
val res = pipeline.fit(text_data).transform(text_data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.rxnorm_to_action_treatment").predict("""Sinequan 150 MG""")
```
## Results
```bash
| | ner_chunk | rxnorm_code | Treatment | Action |
|---:|:--------------------|:--------------|:-------------------------------------------------------------------------------|:-----------------------------------------------------------------------|
| 0 | ['Sinequan 150 MG'] | ['1000067'] | ['Alcoholism', 'Depression', 'Neurosis', 'Anxiety&Panic Attacks', 'Psychosis'] | ['Antidepressant', 'Anxiolytic', 'Psychoanaleptics', 'Sedative'] |
| 1 | ['Zonalon 50 mg'] | ['103971'] | ['Pain'] | ['Analgesic', 'Analgesic (Opioid)', 'Analgetic', 'Opioid', 'Vitamins'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|rxnorm_action_treatment_mapper|
|Compatibility:|Healthcare NLP 3.5.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|19.3 MB|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `recipe_triplet_recipes-roberta-base_TIMESTEP_squadv2_epochs_3` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3_en_4.3.0_3.0_1674212220798.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3_en_4.3.0_3.0_1674212220798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|467.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/recipe_triplet_recipes-roberta-base_TIMESTEP_squadv2_epochs_3
---
layout: model
title: Legal Rules of construction Clause Binary Classifier
author: John Snow Labs
name: legclf_rules_of_construction_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `rules-of-construction` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `rules-of-construction`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_rules_of_construction_clause_en_1.0.0_3.2_1660122976182.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_rules_of_construction_clause_en_1.0.0_3.2_1660122976182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[rules-of-construction]|
|[other]|
|[other]|
|[rules-of-construction]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_rules_of_construction_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.96 0.99 0.97 98
rules-of-construction 0.98 0.91 0.94 46
accuracy - - 0.97 144
macro-avg 0.97 0.95 0.96 144
weighted-avg 0.97 0.97 0.96 144
```
---
layout: model
title: Translate Dravidian languages to English Pipeline
author: John Snow Labs
name: translate_dra_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, dra, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `dra`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_dra_en_xx_2.7.0_2.4_1609686505330.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_dra_en_xx_2.7.0_2.4_1609686505330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_dra_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_dra_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.dra.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_dra_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Compliance Clause Binary Classifier
author: John Snow Labs
name: legclf_compliance_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `compliance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `compliance`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_compliance_clause_en_1.0.0_3.2_1660122240327.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_compliance_clause_en_1.0.0_3.2_1660122240327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[compliance]|
|[other]|
|[other]|
|[compliance]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_compliance_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
compliance 0.80 0.82 0.81 67
other 0.94 0.93 0.93 188
accuracy - - 0.90 255
macro-avg 0.87 0.87 0.87 255
weighted-avg 0.90 0.90 0.90 255
```
---
layout: model
title: Sentence Embeddings - sbiobert (tuned)
author: John Snow Labs
name: sbiobert_jsl_umls_cased
date: 2021-06-30
tags: [embeddings, clinical, licensed, en]
task: Embeddings
language: en
nav_key: models
edition: Healthcare NLP 3.1.0
spark_version: 2.4
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained to generate contextual sentence embeddings of input sentences.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_umls_cased_en_3.1.0_2.4_1625050246280.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_umls_cased_en_3.1.0_2.4_1625050246280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_jsl_umls_cased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings")
```
```scala
val sbiobert_embeddings = BertSentenceEmbeddings
.pretrained("sbiobert_jsl_umls_cased","en","clinical/models")
.setInputCols(Array("sentence"))
.setOutputCol("sbert_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed_sentence.biobert.jsl_umls_cased").predict("""Put your text here.""")
```
## Results
```bash
Gives a 768 dimensional vector representation of the sentence.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobert_jsl_umls_cased|
|Compatibility:|Healthcare NLP 3.1.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Case sensitive:|true|
## Data Source
Tuned on MedNLI dataset
## Benchmarking
```bash
MedNLI Score
Acc 0.758
STS(cos) 0.651
```
---
layout: model
title: BERT Sentence Embeddings trained on MEDLINE/PubMed and fine-tuned on SQuAD 2.0
author: John Snow Labs
name: sent_bert_pubmed_squad2
date: 2021-08-31
tags: [en, open_source, sentence_embeddings, medline_pubmed_dataset, squad_2_dataset]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/pubmed/1 and fine-tuned on SQuAD 2.0. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings.
This model is intended to be used for a variety of English NLP tasks in the medical domain. This model is fine-tuned on the SQuAD 2.0 as a span-labeling task to label the answer to a question in a given context and is recommended for use in question answering tasks.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_pubmed_squad2_en_3.2.0_3.0_1630412086842.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_pubmed_squad2_en_3.2.0_3.0_1630412086842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_pubmed_squad2", "en") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
```
```scala
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_pubmed_squad2", "en")
.setInputCols("sentence")
.setOutputCol("bert_sentence")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
sent_embeddings_df = nlu.load('en.embed_sentence.bert.pubmed_squad2').predict(text, output_level='sentence')
sent_embeddings_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_pubmed_squad2|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[bert_sentence]|
|Language:|en|
|Case sensitive:|false|
## Data Source
[1]: [MEDLINE/PubMed dataset](https://www.nlm.nih.gov/databases/download/pubmed_medline.html)
[2]: [Stanford Queston Answering (SQuAD 2.0) dataset](https://rajpurkar.github.io/SQuAD-explorer/)
This Model has been imported from: https://tfhub.dev/google/experts/bert/pubmed/squad2/2
---
layout: model
title: Detect Persons, Locations, Organizations and Misc Entities in Russian (WikiNER 840B 300)
author: John Snow Labs
name: wikiner_840B_300
date: 2020-03-16
task: Named Entity Recognition
language: ru
edition: Spark NLP 2.4.4
spark_version: 2.4
tags: [ner, ru, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline.
{:.h2_title}
## Predicted Entities
Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_RU){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_RU.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_ru_2.4.4_2.4_1584014001695.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_ru_2.4.4_2.4_1584014001695.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained("wikiner_840B_300", "ru") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла.']], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("wikiner_840B_300", "ru")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла."""]
ner_df = nlu.load('ru.ner.wikiner.glove.840B_300').predict(text, output_level = "chunk")
ner_df[["entities", "entities_confidence"]]
```
{:.h2_title}
## Results
```bash
+----------------------+---------+
|chunk |ner_label|
+----------------------+---------+
|Уильям Генри Гейтс III|PER |
|Microsoft |ORG |
|Microsoft Гейтс |ORG |
|CEO |ORG |
|Гейтс |PER |
|Сиэтле |LOC |
|Вашингтон |LOC |
|Полом Алленом |PER |
|Альбукерке |LOC |
|Нью-Мексико |LOC |
|Microsoft |ORG |
|Гейтс |PER |
|Гейтс |PER |
|Гейтс |PER |
|Microsoft |ORG |
|Фонде Билла |PER |
|Мелинды Гейтс |PER |
|Мелинда Гейтс |PER |
|Постепенно |PER |
|Рэю Оззи |PER |
+----------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wikiner_840B_300|
|Type:|ner|
|Compatibility:| Spark NLP 2.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ru|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is trained based on data from [https://ru.wikipedia.org](https://ru.wikipedia.org)
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Original-BlueBERT-384` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384_en_4.0.0_3.0_1657108595543.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384_en_4.0.0_3.0_1657108595543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|408.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Original-BlueBERT-384
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4_original_PubmedBert
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4-original-PubmedBert` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_original_PubmedBert_en_4.0.0_3.0_1657108214734.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_original_PubmedBert_en_4.0.0_3.0_1657108214734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_original_PubmedBert","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_original_PubmedBert","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4_original_PubmedBert|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|408.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4-original-PubmedBert
---
layout: model
title: Word Embeddings for Arabic (arabic_w2v_cc_300d)
author: John Snow Labs
name: arabic_w2v_cc_300d
date: 2020-12-05
task: Embeddings
language: ar
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [embeddings, ar, open_source]
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained on Common Crawl and Wikipedia using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words.
These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/arabic_w2v_cc_300d_ar_2.7.0_2.4_1607168354606.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/arabic_w2v_cc_300d_ar_2.7.0_2.4_1607168354606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of a pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['أنا أحب التعلم الآلي']], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("أنا أحب التعلم الآلي").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["أنا أحب التعلم الآلي"]
arabicvec_df = nlu.load('ar.embed.cbow.300d').predict(text, output_level='token')
arabicvec_df
```
{:.h2_title}
## Results
The model gives 300 dimensional Word2Vec feature vector outputs per token.
```bash
| ar_embed_cbow_300d_embeddings token
|----------------------------------------------------|--------
| [-0.11158058792352676, -0.06634224951267242, -... أنا
| [-0.2818698585033417, -0.21061033010482788, -0... أحب
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|arabic_w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[word_embeddings]|
|Language:|ar|
|Case sensitive:|false|
|Dimension:|300|
## Data Source
This model is imported from [https://fasttext.cc/docs/en/crawl-vectors.html](https://fasttext.cc/docs/en/crawl-vectors.html)
---
layout: model
title: English DistilBertForQuestionAnswering model (from jgammack) SAE
author: John Snow Labs
name: distilbert_qa_SAE_base_uncased_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SAE-distilbert-base-uncased-squad` is a English model originally trained by `jgammack`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_SAE_base_uncased_squad_en_4.0.0_3.0_1654722995068.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_SAE_base_uncased_squad_en_4.0.0_3.0_1654722995068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_SAE_base_uncased_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_SAE_base_uncased_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased_sae.by_jgammack").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_SAE_base_uncased_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/jgammack/SAE-distilbert-base-uncased-squad
---
layout: model
title: Legal Definitions Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_definitions_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, definitions, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Definitions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Definitions`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_definitions_bert_en_1.0.0_3.0_1678049915628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_definitions_bert_en_1.0.0_3.0_1678049915628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Definitions]|
|[Other]|
|[Other]|
|[Definitions]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_definitions_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Definitions 0.97 0.96 0.96 67
Other 0.97 0.98 0.97 95
accuracy - - 0.97 162
macro-avg 0.97 0.97 0.97 162
weighted-avg 0.97 0.97 0.97 162
```
---
layout: model
title: Relation Extraction between anatomical entities and other clinical entities (ReDL)
author: John Snow Labs
name: redl_oncology_location_biobert_wip
date: 2023-01-15
tags: [licensed, clinical, oncology, en, relation_extraction, anatomy, tensorflow]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This relation extraction model links extractions from anatomical entities (such as Site_Breast or Site_Lung) to other clinical entities (such as Tumor_Finding or Cancer_Surgery).
## Predicted Entities
`is_location_of`, `O`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_location_biobert_wip_en_4.2.4_3.0_1673770597615.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_location_biobert_wip_en_4.2.4_3.0_1673770597615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos_tags")
dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \
.setInputCols(["sentence", "pos_tags", "token"]) \
.setOutputCol("dependencies")
re_ner_chunk_filter = RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["Tumor_Finding-Site_Breast", "Site_Breast-Tumor_Finding", "Tumor_Finding-Anatomical_Site", "Anatomical_Site-Tumor_Finding"])
re_model = RelationExtractionDLModel.pretrained("redl_oncology_location_biobert_wip", "en", "clinical/models")\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relation_extraction")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model])
data = spark.createDataFrame([["In April 2011, she first noticed a lump in her right breast."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos_tags", "token"))
.setOutputCol("dependencies")
val re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunk", "dependencies"))
.setOutputCol("re_ner_chunk")
.setMaxSyntacticDistance(10)
.setRelationPairs(Array("Tumor_Finding-Site_Breast", "Site_Breast-Tumor_Finding","Tumor_Finding-Anatomical_Site", "Anatomical_Site-Tumor_Finding"))
val re_model = RelationExtractionDLModel.pretrained("redl_oncology_location_biobert_wip", "en", "clinical/models")
.setPredictionThreshold(0.5f)
.setInputCols(Array("re_ner_chunk", "sentence"))
.setOutputCol("relation_extraction")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model))
val data = Seq("""In April 2011, she first noticed a lump in her right breast.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.oncology_location_biobert_wip").predict("""In April 2011, she first noticed a lump in her right breast.""")
```
## Results
```bash
+--------------+-------------+-------------+-----------+------+-----------+-------------+-----------+------+----------+
| relation| entity1|entity1_begin|entity1_end|chunk1| entity2|entity2_begin|entity2_end|chunk2|confidence|
+--------------+-------------+-------------+-----------+------+-----------+-------------+-----------+------+----------+
|is_location_of|Tumor_Finding| 35| 38| lump|Site_Breast| 53| 58|breast| 0.9628376|
+--------------+-------------+-------------+-----------+------+-----------+-------------+-----------+------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_oncology_location_biobert_wip|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|401.7 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label recall precision f1
O 0.90 0.94 0.92
is_location_of 0.94 0.90 0.92
macro-avg 0.92 0.92 0.92
```
---
layout: model
title: Chinese BertForMaskedLM Cased model (from hfl)
author: John Snow Labs
name: bert_embeddings_rbt6
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt6` is a Chinese model originally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt6_zh_4.2.4_3.0_1670327118775.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt6_zh_4.2.4_3.0_1670327118775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt6","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt6","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_rbt6|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|224.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/rbt6
- https://arxiv.org/abs/1906.08101
- https://github.com/google-research/bert
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/1906.08101
---
layout: model
title: Legal Cause Clause Binary Classifier
author: John Snow Labs
name: legclf_cause_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `cause` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `cause`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cause_clause_en_1.0.0_3.2_1660122210522.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cause_clause_en_1.0.0_3.2_1660122210522.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[cause]|
|[other]|
|[other]|
|[cause]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_cause_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.1 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
cause 0.96 1.00 0.98 27
other 1.00 0.99 1.00 108
accuracy - - 0.99 135
macro-avg 0.98 1.00 0.99 135
weighted-avg 0.99 0.99 0.99 135
```
---
layout: model
title: Chinese BertForQuestionAnswering model (from jackh1995)
author: John Snow Labs
name: bert_qa_bert_chinese_finetuned
date: 2022-06-02
tags: [zh, open_source, question_answering, bert]
task: Question Answering
language: zh
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-chinese-finetuned` is a Chinese model orginally trained by `jackh1995`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_chinese_finetuned_zh_4.0.0_3.0_1654181635362.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_chinese_finetuned_zh_4.0.0_3.0_1654181635362.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_chinese_finetuned","zh") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_chinese_finetuned","zh")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.answer_question.bert.by_jackh1995").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_chinese_finetuned|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|zh|
|Size:|381.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/jackh1995/bert-chinese-finetuned
---
layout: model
title: Stopwords Remover for Ancient Greek language (907 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, grc, open_source]
task: Stop Words Removal
language: grc
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_grc_3.4.1_3.0_1646673167002.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_grc_3.4.1_3.0_1646673167002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","grc") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Όντας δε θνητούς θνητά και φρονείν χρεών."]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","grc")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Όντας δε θνητούς θνητά και φρονείν χρεών.").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("grc.stopwords").predict("""Όντας δε θνητούς θνητά και φρονείν χρεών.""")
```
## Results
```bash
+---------------------------------------------------+
|result |
+---------------------------------------------------+
|[Όντας, δε, θνητούς, θνητά, και, φρονείν, χρεών, .]|
+---------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|grc|
|Size:|4.5 KB|
---
layout: model
title: English BertForQuestionAnswering model (from rsvp-ai)
author: John Snow Labs
name: bert_qa_bertserini_bert_base_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertserini-bert-base-squad` is a English model orginally trained by `rsvp-ai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bertserini_bert_base_squad_en_4.0.0_3.0_1654185449571.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bertserini_bert_base_squad_en_4.0.0_3.0_1654185449571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bertserini_bert_base_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bertserini_bert_base_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base.by_rsvp-ai").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bertserini_bert_base_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/rsvp-ai/bertserini-bert-base-squad
---
layout: model
title: Google's Tapas Table Understanding (Mini, WTQ)
author: John Snow Labs
name: table_qa_tapas_mini_finetuned_wtq
date: 2022-09-30
tags: [en, table, qa, question, answering, open_source]
task: Table Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: TapasForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark.
Size of this model: Mini
Has aggregation operations?: True
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_mini_finetuned_wtq_en_4.2.0_3.0_1664530449660.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_mini_finetuned_wtq_en_4.2.0_3.0_1664530449660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
json_data = """
{
"header": ["name", "money", "age"],
"rows": [
["Donald Trump", "$100,000,000", "75"],
["Elon Musk", "$20,000,000,000,000", "55"]
]
}
"""
queries = [
"Who earns less than 200,000,000?",
"Who earns 100,000,000?",
"How much money has Donald Trump?",
"How old are they?",
]
data = spark.createDataFrame([
[json_data, " ".join(queries)]
]).toDF("table_json", "questions")
document_assembler = MultiDocumentAssembler() \
.setInputCols("table_json", "questions") \
.setOutputCols("document_table", "document_questions")
sentence_detector = SentenceDetector() \
.setInputCols(["document_questions"]) \
.setOutputCol("questions")
table_assembler = TableAssembler()\
.setInputCols(["document_table"])\
.setOutputCol("table")
tapas = TapasForQuestionAnswering\
.pretrained("table_qa_tapas_mini_finetuned_wtq","en")\
.setInputCols(["questions", "table"])\
.setOutputCol("answers")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
table_assembler,
tapas
])
model = pipeline.fit(data)
model\
.transform(data)\
.selectExpr("explode(answers) AS answer")\
.select("answer")\
.show(truncate=False)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|answer |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} |
|{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|table_qa_tapas_mini_finetuned_wtq|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|43.4 MB|
|Case sensitive:|false|
## References
https://www.microsoft.com/en-us/download/details.aspx?id=54253
https://github.com/ppasupat/WikiTableQuestions
---
layout: model
title: Fast Neural Machine Translation Model from English to Efik
author: John Snow Labs
name: opus_mt_en_efi
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, efi, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `efi`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_efi_xx_2.7.0_2.4_1609164796629.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_efi_xx_2.7.0_2.4_1609164796629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_efi", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_efi", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.efi').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_efi|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab11_by_sameearif88 TFWav2Vec2ForCTC from sameearif88
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab11_by_sameearif88
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab11_by_sameearif88` is a English model originally trained by sameearif88.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_en_4.2.0_3.0_1664021285479.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_en_4.2.0_3.0_1664021285479.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab11_by_sameearif88", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab11_by_sameearif88", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab11_by_sameearif88|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: English AlbertForQuestionAnswering model (from madlag)
author: John Snow Labs
name: albert_qa_base_v2_squad
date: 2022-06-24
tags: [en, open_source, albert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: AlBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-base-v2-squad` is a English model originally trained by `madlag`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_base_v2_squad_en_4.0.0_3.0_1656063705520.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_base_v2_squad_en_4.0.0_3.0_1656063705520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_base_v2_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_base_v2_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.albert.base_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_qa_base_v2_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|42.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/madlag/albert-base-v2-squad
- https://github.com/google-research/albert
---
layout: model
title: Extract relations between problem, test, and findings in reports
author: John Snow Labs
name: re_test_problem_finding
date: 2021-04-19
tags: [en, relation_extraction, licensed, clinical]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 2.7.1
spark_version: 2.4
supported: true
annotator: RelationExtractionModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Find relations between diagnosis, tests and imaging findings in radiology reports. `1` : The two entities are related. `0` : The two entities are not related
## Predicted Entities
`0`, `1`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_RADIOLOGY/){:.button.button-orange}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_test_problem_finding_en_2.7.1_2.4_1618830922197.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_test_problem_finding_en_2.7.1_2.4_1618830922197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
In the table below, `re_test_problem_finding` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated.
| RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS |
|:-----------------------:|:--------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| re_test_problem_finding | 0,1 | ner_jsl | [“test-cerebrovascular_disease”, “cerebrovascular_disease-test”, “test-communicable_disease”, “communicable_disease-test”, “test-diabetes”, “diabetes-test”, “test-disease_syndrome_disorder”, “disease_syndrome_disorder-test”, “test-heart_disease”, “heart_disease-test”, “test-hyperlipidemia”, “hyperlipidemia-test”, “test-hypertension”, “hypertension-test”, “test-injury_or_poisoning”, “injury_or_poisoning-test”, “test-kidney_disease”, “kidney_disease-test”, “test-obesity”, “obesity-test”, “test-oncological”, “oncological-test”, “test-psychological_condition”, “psychological_condition-test”, “test-symptom”, “symptom-test”, “ekg_findings-disease_syndrome_disorder”, “disease_syndrome_disorder-ekg_findings”, “ekg_findings-heart_disease”, “heart_disease-ekg_findings”, “ekg_findings-symptom”, “symptom-ekg_findings”, “imagingfindings-cerebrovascular_disease”, “cerebrovascular_disease-imagingfindings”, “imagingfindings-communicable_disease”, “communicable_disease-imagingfindings”, “imagingfindings-disease_syndrome_disorder”, “disease_syndrome_disorder-imagingfindings”, “imagingfindings-heart_disease”, “heart_disease-imagingfindings”, “imagingfindings-hyperlipidemia”, “hyperlipidemia-imagingfindings”, “imagingfindings-hypertension”, “hypertension-imagingfindings”, “imagingfindings-injury_or_poisoning”, “injury_or_poisoning-imagingfindings”, “imagingfindings-kidney_disease”, “kidney_disease-imagingfindings”, “imagingfindings-oncological”, “oncological-imagingfindings”, “imagingfindings-psychological_condition”, “psychological_condition-imagingfindings”, “imagingfindings-symptom”, “symptom-imagingfindings”, “vs_finding-cerebrovascular_disease”, “cerebrovascular_disease-vs_finding”, “vs_finding-communicable_disease”, “communicable_disease-vs_finding”, “vs_finding-diabetes”, “diabetes-vs_finding”, “vs_finding-disease_syndrome_disorder”, “disease_syndrome_disorder-vs_finding”, “vs_finding-heart_disease”, “heart_disease-vs_finding”, “vs_finding-hyperlipidemia”, “hyperlipidemia-vs_finding”, “vs_finding-hypertension”, “hypertension-vs_finding”, “vs_finding-injury_or_poisoning”, “injury_or_poisoning-vs_finding”, “vs_finding-kidney_disease”, “kidney_disease-vs_finding”, “vs_finding-obesity”, “obesity-vs_finding”, “vs_finding-oncological”, “oncological-vs_finding”, “vs_finding-overweight”, “overweight-vs_finding”, “vs_finding-psychological_condition”, “psychological_condition-vs_finding”, “vs_finding-symptom”, “symptom-vs_finding”] |
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
ner_tagger = MedicalNerModel()\
.pretrained('jsl_ner_wip_clinical',"en","clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_chunker = NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
re_model = RelationExtractionModel()\
.pretrained("re_test_problem_finding", "en", 'clinical/models')\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setMaxSyntacticDistance(4)\
.setPredictionThreshold(0.9)\
.setRelationPairs(["procedure-symptom"])
nlp_pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("""Targeted biopsy of this lesion for histological correlation should be considered.""")
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val ner_tagger = MedicalNerModel()
.pretrained("jsl_ner_wip_clinical","en","clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_chunker = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
val re_model = RelationExtractionModel()
.pretrained("re_test_problem_finding", "en", "clinical/models")
.setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies"))
.setOutputCol("relations")
.setMaxSyntacticDistance(4)
.setPredictionThreshold(0.9)
.setRelationPairs("procedure-symptom")
val nlp_pipeline = new Pipeline().setStagesArray(documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model))
val data = Seq("""Targeted biopsy of this lesion for histological correlation should be considered.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| index | relations | entity1 | chunk1 | entity2 | chunk2 |
|-------|--------------|--------------|---------------------|--------------|---------|
| 0 | 1 | PROCEDURE | biopsy | SYMPTOM | lesion |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|re_test_problem_finding|
|Type:|re|
|Compatibility:|Healthcare NLP 2.7.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]|
|Output Labels:|[relations]|
|Language:|en|
## Data Source
Trained on internal datasets.
---
layout: model
title: English BertForQuestionAnswering Cased model (from maroo93)
author: John Snow Labs
name: bert_qa_kd_squad1.1
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `kd_squad1.1` is a English model originally trained by `maroo93`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_kd_squad1.1_en_4.0.0_3.0_1657189570964.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_kd_squad1.1_en_4.0.0_3.0_1657189570964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_kd_squad1.1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_kd_squad1.1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_kd_squad1.1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|249.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/maroo93/kd_squad1.1
---
layout: model
title: Chinese Part of Speech Tagger (from raynardj)
author: John Snow Labs
name: bert_pos_classical_chinese_punctuation_guwen_biaodian
date: 2022-05-09
tags: [bert, pos, part_of_speech, zh, open_source]
task: Part of Speech Tagging
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `classical-chinese-punctuation-guwen-biaodian` is a Chinese model orginally trained by `raynardj`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_classical_chinese_punctuation_guwen_biaodian_zh_3.4.2_3.0_1652088290238.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_classical_chinese_punctuation_guwen_biaodian_zh_3.4.2_3.0_1652088290238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_classical_chinese_punctuation_guwen_biaodian","zh") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_classical_chinese_punctuation_guwen_biaodian","zh")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_classical_chinese_punctuation_guwen_biaodian|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|zh|
|Size:|381.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/raynardj/classical-chinese-punctuation-guwen-biaodian
- https://github.com/raynardj/yuan
- https://github.com/raynardj/yuan
- https://github.com/raynardj/yuan
---
layout: model
title: English RobertaForQuestionAnswering (from sunitha)
author: John Snow Labs
name: roberta_qa_Roberta_Custom_Squad_DS
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Roberta_Custom_Squad_DS` is a English model originally trained by `sunitha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_Roberta_Custom_Squad_DS_en_4.0.0_3.0_1655727273046.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_Roberta_Custom_Squad_DS_en_4.0.0_3.0_1655727273046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_Roberta_Custom_Squad_DS","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_Roberta_Custom_Squad_DS","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.by_sunitha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_Roberta_Custom_Squad_DS|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sunitha/Roberta_Custom_Squad_DS
---
layout: model
title: Ukrainian BertForMaskedLM Base Cased model (from Geotrend)
author: John Snow Labs
name: bert_embeddings_base_uk_cased
date: 2022-12-02
tags: [uk, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: uk
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uk-cased` is a Ukrainian model originally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_uk_cased_uk_4.2.4_3.0_1670019147754.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_uk_cased_uk_4.2.4_3.0_1670019147754.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_uk_cased","uk") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_uk_cased","uk")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_uk_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|uk|
|Size:|357.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-uk-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Translate English to San Salvador Kongo Pipeline
author: John Snow Labs
name: translate_en_kwy
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, kwy, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `kwy`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_kwy_xx_2.7.0_2.4_1609688437952.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_kwy_xx_2.7.0_2.4_1609688437952.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_kwy", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_kwy", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.kwy').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_kwy|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Greek BertForQuestionAnswering Cased model (from Danastos)
author: John Snow Labs
name: bert_qa_qacombination_el_4
date: 2022-07-07
tags: [el, open_source, bert, question_answering]
task: Question Answering
language: el
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qacombination_bert_el_4` is a Greek model originally trained by `Danastos`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_qacombination_el_4_el_4.0.0_3.0_1657190786453.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_qacombination_el_4_el_4.0.0_3.0_1657190786453.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_qacombination_el_4","el") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_qacombination_el_4","el")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_qacombination_el_4|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|el|
|Size:|421.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Danastos/qacombination_bert_el_4
---
layout: model
title: Company Name Normalization using Nasdaq Stock Screener
author: John Snow Labs
name: finel_nasdaq_company_name_stock_screener
date: 2023-01-20
tags: [en, finance, licensed, nasdaq, company]
task: Entity Resolution
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Financial Entity Resolver model, trained to obtain normalized versions of Company Names, registered in NASDAQ Stock Screener. You can use this model after extracting a company name using any NER, and you will obtain the official name of the company as per NASDAQ Stock Screener.
After this, you can use `finmapper_nasdaq_company_name_stock_screener` to augment and obtain more information about a company using NASDAQ Stock Screener, including Ticker, Sector, Country, etc.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_company_name_stock_screener_en_1.0.0_3.0_1674233034536.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_company_name_stock_screener_en_1.0.0_3.0_1674233034536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk")
chunkToDoc = nlp.Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
chunk_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
.setInputCols("ner_chunk_doc") \
.setOutputCol("sentence_embeddings")
use_er_model = finance.SentenceEntityResolverModel.pretrained("finel_nasdaq_company_name_stock_screener", "en", "finance/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("normalized")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
tokenizer,
embeddings,
ner_model,
ner_converter,
chunkToDoc,
chunk_embeddings,
use_er_model
])
text = """NIKE is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services."""
test_data = spark.createDataFrame([[text]]).toDF("text")
model = nlpPipeline.fit(test_data)
lp = nlp.LightPipeline(model)
result = lp.annotate(text)
result["normalized"]
```
## Results
```bash
['Nike Inc. Common Stock']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finel_nasdaq_company_name_stock_screener|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[normalized]|
|Language:|en|
|Size:|54.7 MB|
|Case sensitive:|false|
## References
https://www.nasdaq.com/market-activity/stocks/screener
---
layout: model
title: Self Reported Stress Classifier (BioBERT)
author: John Snow Labs
name: bert_sequence_classifier_self_reported_stress_tweet
date: 2022-07-29
tags: [en, licenced, clinical, public_health, sequence_classification, classifier, stress, licensed]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
recommended: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can identify stress in social media (Twitter) posts in the self-disclosure category. The model finds whether a person claims he/she is stressed or not.
## Predicted Entities
`not-stressed`, `stressed`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_STRESS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_stress_tweet_en_4.0.0_3.0_1659087442993.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_stress_tweet_en_4.0.0_3.0_1659087442993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_stress_tweet", "en", "clinical/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame([["Do you feel stressed?"],
["I'm so stressed!"],
["Depression and anxiety will probably end up killing me – I feel so stressed all the time and just feel awful."],
["Do you enjoy living constantly in this self-inflicted stress?"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("text", "class.result").show(truncate=False)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_stress_tweet", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val data = Seq(Array("Do you feel stressed!",
"I'm so stressed!",
"Depression and anxiety will probably end up killing me – I feel so stressed all the time and just feel awful.",
"Do you enjoy living constantly in this self-inflicted stress?")).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.self_reported_stress").predict("""Depression and anxiety will probably end up killing me – I feel so stressed all the time and just feel awful.""")
```
## Results
```bash
+-------------------------------------------------------------------------------------------------------------+--------------+
|text |result |
+-------------------------------------------------------------------------------------------------------------+--------------+
|Do you feel stressed? |[not-stressed]|
|I'm so stressed! |[stressed] |
|Depression and anxiety will probably end up killing me – I feel so stressed all the time and just feel awful.|[stressed] |
|Do you enjoy living constantly in this self-inflicted stress? |[not-stressed]|
+-------------------------------------------------------------------------------------------------------------+--------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_self_reported_stress_tweet|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## Benchmarking
```bash
label precision recall f1-score support
not-stressed 0.8564 0.8020 0.8283 409
stressed 0.7197 0.7909 0.7536 263
accuracy - - 0.7976 672
macro-avg 0.7881 0.7964 0.7910 672
weighted-avg 0.8029 0.7976 0.7991 672
```
---
layout: model
title: English asr_wav2vec2_large_960h_lv60_self_4_gram TFWav2Vec2ForCTC from patrickvonplaten
author: John Snow Labs
name: asr_wav2vec2_large_960h_lv60_self_4_gram
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60_self_4_gram` is a English model originally trained by patrickvonplaten.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_960h_lv60_self_4_gram_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_self_4_gram_en_4.2.0_3.0_1664021695681.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_self_4_gram_en_4.2.0_3.0_1664021695681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_960h_lv60_self_4_gram", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_960h_lv60_self_4_gram", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_960h_lv60_self_4_gram|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|757.4 MB|
---
layout: model
title: Turkish XlmRoBertaForQuestionAnswering (from Aybars)
author: John Snow Labs
name: xlm_roberta_qa_XLM_Turkish
date: 2022-06-23
tags: [tr, open_source, question_answering, xlmroberta]
task: Question Answering
language: tr
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `XLM_Turkish` is a Turkish model originally trained by `Aybars`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_XLM_Turkish_tr_4.0.0_3.0_1655983903393.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_XLM_Turkish_tr_4.0.0_3.0_1655983903393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_XLM_Turkish","tr") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_XLM_Turkish","tr")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("tr.answer_question.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_XLM_Turkish|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|tr|
|Size:|792.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Aybars/XLM_Turkish
---
layout: model
title: English asr_wav2vec2_coral_300ep TFWav2Vec2ForCTC from joaoalvarenga
author: John Snow Labs
name: pipeline_asr_wav2vec2_coral_300ep
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_coral_300ep` is a English model originally trained by joaoalvarenga.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_coral_300ep_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_coral_300ep_en_4.2.0_3.0_1664023766656.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_coral_300ep_en_4.2.0_3.0_1664023766656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_coral_300ep', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_coral_300ep", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_coral_300ep|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Pipeline for Detect Subentity PHI for Deidentification (Arabic)
author: John Snow Labs
name: ner_deid_subentity_pipeline
date: 2023-05-31
tags: [licensed, clinical, deidentification, ar, pipeline]
task: Pipeline Healthcare
language: ar
edition: Healthcare NLP 4.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_subentity](https://nlp.johnsnowlabs.com/2023/05/29/ner_deid_subentity_ar.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_ar_4.4.1_3.0_1685563688023.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_ar_4.4.1_3.0_1685563688023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_subentity_pipeline", "ar", "clinical/models")
text= '''
ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح.
'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_deid_subentity_pipeline", "ar", "clinical/models")
val text = "ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
+---------------+--------+
|chunks |entities|
+---------------+--------+
|16 أبريل 2000 |DATE |
|ليلى حسن |PATIENT |
|789، |ZIP |
|جدة |CITY |
|54321 |ZIP |
|المملكة العربية|CITY |
|السعودية |COUNTRY |
|النور |HOSPITAL|
|أميرة أحمد |DOCTOR |
|ليلى |PATIENT |
|35 |AGE |
+---------------+--------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_subentity_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|ar|
|Size:|1.2 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: English XLMRobertaForTokenClassification Base Cased model (from tner)
author: John Snow Labs
name: xlmroberta_ner_base_bc5cdr
date: 2022-08-13
tags: [en, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-bc5cdr` is a English model originally trained by `tner`.
## Predicted Entities
`chemical`, `disease`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_bc5cdr_en_4.1.0_3.0_1660425851127.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_bc5cdr_en_4.1.0_3.0_1660425851127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_bc5cdr","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_bc5cdr","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_bc5cdr|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|780.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/tner/xlm-roberta-base-bc5cdr
- https://github.com/asahi417/tner
---
layout: model
title: Legal Corporate existence Clause Binary Classifier
author: John Snow Labs
name: legclf_corporate_existence_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `corporate-existence` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `corporate-existence`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_corporate_existence_clause_en_1.0.0_3.2_1660123366726.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_corporate_existence_clause_en_1.0.0_3.2_1660123366726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[corporate-existence]|
|[other]|
|[other]|
|[corporate-existence]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_corporate_existence_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
corporate-existence 0.91 0.93 0.92 43
other 0.96 0.95 0.95 76
accuracy - - 0.94 119
macro-avg 0.93 0.94 0.94 119
weighted-avg 0.94 0.94 0.94 119
```
---
layout: model
title: Bangla BertForMaskedLM Cased model (from neuralspace-reverie)
author: John Snow Labs
name: bert_embeddings_indic_transformers
date: 2022-12-06
tags: [bn, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: bn
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-bn-bert` is a Bangla model originally trained by `neuralspace-reverie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_bn_4.2.4_3.0_1670326563595.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_bn_4.2.4_3.0_1670326563595.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","bn") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","bn")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_indic_transformers|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|bn|
|Size:|505.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/neuralspace-reverie/indic-transformers-bn-bert
- https://oscar-corpus.com/
---
layout: model
title: Extract Cancer Therapies and Granular Posology Information
author: John Snow Labs
name: ner_oncology_posology
date: 2022-11-24
tags: [licensed, clinical, en, oncology, ner, treatment, posology]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts cancer therapies (Cancer_Surgery, Radiotherapy and Cancer_Therapy) and posology information at a granular level.
Definitions of Predicted Entities:
- `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment.
- `Cancer_Therapy`: Any cancer treatment mentioned in text, excluding surgeries and radiotherapy.
- `Cycle_Count`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles").
- `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5").
- `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle").
- `Dosage`: The quantity prescribed by the physician for an active ingredient.
- `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks").
- `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid").
- `Radiotherapy`: Terms that indicate the use of Radiotherapy.
- `Radiation_Dose`: Dose used in radiotherapy.
- `Route`: Words indicating the type of administration route (such as "PO" or "transdermal").
## Predicted Entities
`Cancer_Surgery`, `Cancer_Therapy`, `Cycle_Count`, `Cycle_Day`, `Cycle_Number`, `Dosage`, `Duration`, `Frequency`, `Radiotherapy`, `Radiation_Dose`, `Route`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_en_4.2.2_3.0_1669306988706.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_en_4.2.2_3.0_1669306988706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_posology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_posology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_posology").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""")
```
## Results
```bash
| chunk | ner_label |
|:-----------------|:---------------|
| adriamycin | Cancer_Therapy |
| 60 mg/m2 | Dosage |
| cyclophosphamide | Cancer_Therapy |
| 600 mg/m2 | Dosage |
| six courses | Cycle_Count |
| second cycle | Cycle_Number |
| chemotherapy | Cancer_Therapy |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_posology|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|34.3 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Cycle_Number 52 4 45 97 0.93 0.54 0.68
Cycle_Count 200 63 30 230 0.76 0.87 0.81
Radiotherapy 255 16 55 310 0.94 0.82 0.88
Cancer_Surgery 592 66 227 819 0.90 0.72 0.80
Cycle_Day 175 22 73 248 0.89 0.71 0.79
Frequency 337 44 90 427 0.88 0.79 0.83
Route 53 1 60 113 0.98 0.47 0.63
Cancer_Therapy 1448 81 250 1698 0.95 0.85 0.90
Duration 525 154 236 761 0.77 0.69 0.73
Dosage 858 79 202 1060 0.92 0.81 0.86
Radiation_Dose 86 4 40 126 0.96 0.68 0.80
macro_avg 4581 534 1308 5889 0.90 0.72 0.79
micro_avg 4581 534 1308 5889 0.90 0.78 0.83
```
---
layout: model
title: Indonesian T5ForConditionalGeneration Base Cased model (from Wikidepia)
author: John Snow Labs
name: t5_indot5_base_paraphrase
date: 2023-01-30
tags: [id, open_source, t5]
task: Text Generation
language: id
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `IndoT5-base-paraphrase` is a Indonesian model originally trained by `Wikidepia`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_indot5_base_paraphrase_id_4.3.0_3.0_1675097776595.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_indot5_base_paraphrase_id_4.3.0_3.0_1675097776595.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_indot5_base_paraphrase","id") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_indot5_base_paraphrase","id")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_indot5_base_paraphrase|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|id|
|Size:|1.0 GB|
## References
- https://huggingface.co/Wikidepia/IndoT5-base-paraphrase
---
layout: model
title: Оcr small for printed text
author: John Snow Labs
name: ocr_small_printed
date: 2022-02-16
tags: [en, licensed]
task: OCR Text Detection & Recognition
language: en
nav_key: models
edition: Visual NLP 3.3.3
spark_version: 2.4
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Ocr small model for recognise printed text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_printed_en_3.3.3_2.4_1645007455031.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_printed_en_3.3.3_2.4_1645007455031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
ocr = ImageToTextv2().pretrained("ocr_small_printed", "en", "clinical/ocr")
ocr.setInputCols(["image"])
ocr.setOutputCol("text")
result = ocr.transform(image_text_lines_df).collect()
print(result[0].text)
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
ocr = ImageToTextv2().pretrained("ocr_base_printed", "en", "clinical/ocr")
ocr.setInputCols(["image"])
ocr.setOutputCol("text")
result = ocr.transform(image_text_lines_df).collect()
print(result[0].text)
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
ocr = ImageToTextv2().pretrained("ocr_small_printed", "en", "clinical/ocr")
ocr.setInputCols(["image"])
ocr.setOutputCol("text")
result = ocr.transform(image_text_lines_df).collect()
print(result[0].text)
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
ocr = ImageToTextv2().pretrained("ocr_base_printed", "en", "clinical/ocr")
ocr.setInputCols(["image"])
ocr.setOutputCol("text")
result = ocr.transform(image_text_lines_df).collect()
print(result[0].text)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ocr_small_printed|
|Type:|ocr|
|Compatibility:|Visual NLP 3.3.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|146.7 MB|
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab9 TFWav2Vec2ForCTC from hassnain
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab9
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab9` is a English model originally trained by hassnain.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab9_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab9_en_4.2.0_3.0_1664019939623.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab9_en_4.2.0_3.0_1664019939623.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab9", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab9", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab9|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: Spanish RobertaForQuestionAnswering (from jamarju)
author: John Snow Labs
name: roberta_qa_roberta_base_bne_squad_2.0_es_jamarju
date: 2022-06-21
tags: [es, open_source, question_answering, roberta]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-squad-2.0-es` is a Spanish model originally trained by `jamarju`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_bne_squad_2.0_es_jamarju_es_4.0.0_3.0_1655789380928.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_bne_squad_2.0_es_jamarju_es_4.0.0_3.0_1655789380928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_bne_squad_2.0_es_jamarju","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_bne_squad_2.0_es_jamarju","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squad.roberta.base.by_jamarju").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_bne_squad_2.0_es_jamarju|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|es|
|Size:|456.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/jamarju/roberta-base-bne-squad-2.0-es
- https://github.com/PlanTL-SANIDAD/lm-spanish
- https://github.com/ccasimiro88/TranslateAlignRetrieve
---
layout: model
title: English DistilBertForQuestionAnswering model (from Hoang)
author: John Snow Labs
name: distilbert_qa_Hoang_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Hoang`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Hoang_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724211596.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Hoang_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724211596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Hoang_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Hoang_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Hoang").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_Hoang_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Hoang/distilbert-base-uncased-finetuned-squad
---
layout: model
title: ICD10GM ChunkResolver
author: John Snow Labs
name: chunkresolve_ICD10GM
class: ChunkEntityResolverModel
language: de
repository: clinical/models
date: 2020-09-06
task: Entity Resolution
edition: Healthcare NLP 2.5.5
spark_version: 2.4
tags: [clinical,licensed,entity_resolution,de]
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance.
## Predicted Entities
Codes and their normalized definition with `clinical_embeddings`.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_ICD10GM_de_2.5.5_2.4_1599431635423.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_ICD10GM_de_2.5.5_2.4_1599431635423.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
...
icd10_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_ICD10GM",'de','clinical/models') \
.setInputCols(["token", "chunk_embeddings"]) \
.setOutputCol("icd10_de_code")\
.setDistanceFunction("EUCLIDEAN") \
.setNeighbours(5)
pipeline_icd10 = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, de_embeddings, de_ner, ner_converter, chunk_embeddings, icd10_resolution])
empty_data = spark.createDataFrame([['''Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt. Vom SCLC sind hauptsächlich Peronen mittleren Alters (27-66 Jahre) mit Raucheranamnese betroffen. Etwa 70% der Patienten mit SCLC haben bei Stellung der Diagnose schon extra-thorakale Symptome. Zu den Symptomen gehören Thoraxschmerz, Dyspnoe, Husten und pfeifende Atmung. Die Beteiligung benachbarter Bereiche verursacht Heiserkeit, Dysphagie und Oberes Vena-cava-Syndrom (Obstruktion des Blutflusses durch die Vena cava superior). Zusätzliche Symptome als Folge einer Fernmetastasierung sind ebenfalls möglich. Rauchen und Strahlenexposition sind synergistisch wirkende Risikofaktoren. Die industrielle Exposition mit Bis (Chlormethyläther) ist ein weiterer Risikofaktor. Röntgenaufnahmen des Thorax sind nicht ausreichend empfindlich, um einen SCLC frühzeitig zu erkennen. Röntgenologischen Auffälligkeiten muß weiter nachgegangen werden, meist mit Computertomographie. Die Diagnose wird bioptisch gesichert. Patienten mit SCLC erhalten meist Bestrahlung und/oder Chemotherapie. In Hinblick auf eine Verbesserung der Überlebenschancen der Patienten ist sowohl bei ausgedehnten und bei begrenzten SCLC eine kombinierte Chemotherapie wirksamer als die Behandlung mit Einzelsubstanzen. Es kann auch eine prophylaktische Bestrahlung des Schädels erwogen werden, da innerhalb von 2-3 Jahren nach Behandlungsbeginn ein hohes Risiko für zentralnervöse Metastasen besteht. Das Kleinzellige Bronchialkarzinom ist der aggressivste Lungentumor: Die 5-Jahres-Überlebensrate beträgt 1-5%, der Median des gesamten Überlebens liegt bei etwa 6 bis 10 Monaten.''']]).toDF("text")
model = pipeline_icd10.fit(empty_data)
results = model.transform(data)
```
```scala
...
val icd10_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_ICD10GM",'de','clinical/models')
.setInputCols("token", "chunk_embeddings")
.setOutputCol("icd10_de_code")
.setDistanceFunction("EUCLIDEAN")
.setNeighbours(5)
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, de_embeddings, de_ner, ner_converter, chunk_embeddings, icd10_resolution))
val data = Seq("Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt. Vom SCLC sind hauptsächlich Peronen mittleren Alters (27-66 Jahre) mit Raucheranamnese betroffen. Etwa 70% der Patienten mit SCLC haben bei Stellung der Diagnose schon extra-thorakale Symptome. Zu den Symptomen gehören Thoraxschmerz, Dyspnoe, Husten und pfeifende Atmung. Die Beteiligung benachbarter Bereiche verursacht Heiserkeit, Dysphagie und Oberes Vena-cava-Syndrom (Obstruktion des Blutflusses durch die Vena cava superior). Zusätzliche Symptome als Folge einer Fernmetastasierung sind ebenfalls möglich. Rauchen und Strahlenexposition sind synergistisch wirkende Risikofaktoren. Die industrielle Exposition mit Bis (Chlormethyläther) ist ein weiterer Risikofaktor. Röntgenaufnahmen des Thorax sind nicht ausreichend empfindlich, um einen SCLC frühzeitig zu erkennen. Röntgenologischen Auffälligkeiten muß weiter nachgegangen werden, meist mit Computertomographie. Die Diagnose wird bioptisch gesichert. Patienten mit SCLC erhalten meist Bestrahlung und/oder Chemotherapie. In Hinblick auf eine Verbesserung der Überlebenschancen der Patienten ist sowohl bei ausgedehnten und bei begrenzten SCLC eine kombinierte Chemotherapie wirksamer als die Behandlung mit Einzelsubstanzen. Es kann auch eine prophylaktische Bestrahlung des Schädels erwogen werden, da innerhalb von 2-3 Jahren nach Behandlungsbeginn ein hohes Risiko für zentralnervöse Metastasen besteht. Das Kleinzellige Bronchialkarzinom ist der aggressivste Lungentumor: Die 5-Jahres-Überlebensrate beträgt 1-5%, der Median des gesamten Überlebens liegt bei etwa 6 bis 10 Monaten.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
| Problem | ICD10-Code |
|--------------------|------------|
| Kleinzellige | M01.00 |
| Bronchialkarzinom | I50.0 |
| Kleinzelliger | I37.0 |
| Lungenkrebs | B90.9 |
| SCLC | C83.0 |
| ... | ... |
| Kleinzellige | M01.00 |
| Bronchialkarzinom | I50.0 |
| Lungentumor | C90.31 |
| 1-5% | I37.0 |
| 6 bis 10 Monaten | Q91.6 |
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|--------------------------|
| Name: | chunkresolve_ICD10GM |
| Type: | ChunkEntityResolverModel |
| Compatibility: | Spark NLP 2.5.5+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | [token, chunk_embeddings] |
|Output labels: | [entity] |
| Language: | de |
| Case sensitive: | True |
| Dependencies: | w2v_cc_300d |
{:.h2_title}
## Data Source
FILLUP
---
layout: model
title: Onto Recognize Entities Lg
author: John Snow Labs
name: onto_recognize_entities_lg
date: 2022-06-28
tags: [en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The onto_recognize_entities_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entites.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_lg_en_4.0.0_3.0_1656389642706.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_lg_en_4.0.0_3.0_1656389642706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("onto_recognize_entities_lg", "en")
result = pipeline.annotate("""I love johnsnowlabs! """)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.ner.onto.lg").predict("""I love johnsnowlabs! """)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|onto_recognize_entities_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|2.5 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- NerDLModel
- NerConverter
---
layout: model
title: Translate Hausa to English Pipeline
author: John Snow Labs
name: translate_ha_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, ha, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `ha`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ha_en_xx_2.7.0_2.4_1609686549663.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ha_en_xx_2.7.0_2.4_1609686549663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_ha_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_ha_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.ha.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_ha_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Detect Cancer Genetics (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_bionlp_pipeline
date: 2023-03-20
tags: [bertfortokenclassification, ner, bionlp, en, licensed]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_bionlp](https://nlp.johnsnowlabs.com/2022/01/03/bert_token_classifier_ner_bionlp_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_pipeline_en_4.3.0_3.2_1679308593451.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_pipeline_en_4.3.0_3.2_1679308593451.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_bionlp_pipeline", "en", "clinical/models")
text = '''Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bionlp_pipeline", "en", "clinical/models")
val text = "Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.biolp.pipeline").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:--------------------|--------:|------:|:-----------------------|-------------:|
| 0 | erbA IRES | 9 | 17 | Organism | 0.999188 |
| 1 | erbA/myb virus | 27 | 40 | Organism | 0.999434 |
| 2 | erythroid cells | 65 | 79 | Cell | 0.999837 |
| 3 | bone | 100 | 103 | Multi-tissue_structure | 0.999846 |
| 4 | marrow | 105 | 110 | Multi-tissue_structure | 0.999876 |
| 5 | blastoderm cultures | 115 | 133 | Cell | 0.999823 |
| 6 | erbA/myb IRES virus | 140 | 158 | Organism | 0.999751 |
| 7 | erbA IRES virus | 236 | 250 | Organism | 0.999749 |
| 8 | blastoderm | 259 | 268 | Cell | 0.999897 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_bionlp_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|405.0 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: Fast Neural Machine Translation Model from English to Niger-Kordofanian Languages
author: John Snow Labs
name: opus_mt_en_nic
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, nic, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `nic`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_nic_xx_2.7.0_2.4_1609167803723.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_nic_xx_2.7.0_2.4_1609167803723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_nic", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_nic", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.nic').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_nic|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English T5ForConditionalGeneration Cased model (from google)
author: John Snow Labs
name: t5_efficient_xl_nl4
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-xl-nl4` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_xl_nl4_en_4.3.0_3.0_1675124613893.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_xl_nl4_en_4.3.0_3.0_1675124613893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_xl_nl4","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_xl_nl4","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_xl_nl4|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|1.0 GB|
## References
- https://huggingface.co/google/t5-efficient-xl-nl4
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Legal Quiet Enjoyment Clause Binary Classifier
author: John Snow Labs
name: legclf_quiet_enjoyment_clause
date: 2023-01-29
tags: [en, legal, classification, quiet, enjoyment, clauses, quiet_enjoyment, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `quiet-enjoyment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`quiet-enjoyment`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_quiet_enjoyment_clause_en_1.0.0_3.0_1675005306234.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_quiet_enjoyment_clause_en_1.0.0_3.0_1675005306234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[quiet-enjoyment]|
|[other]|
|[other]|
|[quiet-enjoyment]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_quiet_enjoyment_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.95 1.00 0.97 39
quiet-enjoyment 1.00 0.94 0.97 33
accuracy - - 0.97 72
macro-avg 0.98 0.97 0.97 72
weighted-avg 0.97 0.97 0.97 72
```
---
layout: model
title: Sentence Entity Resolver for ATC (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_atc
date: 2022-03-01
tags: [atc, licensed, en, clinical, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps drugs entities to ATC (Anatomic Therapeutic Chemical) codes using `sbiobert_base_cased_mli ` Sentence Bert Embeddings.
## Predicted Entities
`ATC Codes`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_atc_en_3.4.1_3.0_1646126349436.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_atc_en_3.4.1_3.0_1646126349436.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("word_embeddings")
posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "word_embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["DRUG"])
c2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sentence_embeddings")\
.setCaseSensitive(False)
atc_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_atc", "en", "clinical/models")\
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("atc_code")\
.setDistanceFunction("EUCLIDEAN")
resolver_pipeline = Pipeline(
stages = [
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
posology_ner,
ner_converter,
c2doc,
sbert_embedder,
atc_resolver
])
sampleText = ["""He was seen by the endocrinology service and she was discharged on eltrombopag at night, amlodipine with meals metformin two times a day.""",
"""She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide.""",
"""She was given antidepressant for a month"""]
data = spark.createDataFrame(sampleText, StringType()).toDF("text")
results = resolver_pipeline.fit(data).transform(data)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("word_embeddings")
val posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "word_embeddings"))
.setOutputCol("ner")
val ner_converter = NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("DRUG"))
val c2doc = Chunk2Doc()
.setInputCols(Array("ner_chunk"))
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sentence_embeddings")
.setCaseSensitive(False)
val atc_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_atc", "en", "clinical/models")
.setInputCols(Array("sentence_embeddings"))
.setOutputCol("atc_code")
.setDistanceFunction("EUCLIDEAN")
val resolver_pipeline = new PipelineModel().setStages(Array(document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, posology_ner,
ner_converter, c2doc, sbert_embedder, atc_resolver))
val data = Seq("He was seen by the endocrinology service and she was discharged on eltrombopag at night, amlodipine with meals metformin two times a day and then ibuprofen. She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide. She was given antidepressant for a month").toDF("text")
val results = resolver_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.atc").predict("""She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide.""")
```
## Results
```bash
+-------------------+--------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| chunk|atc_code| all_k_codes| resolutions| all_k_aux_labels|
+-------------------+--------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| eltrombopag| B02BX05|B02BX05:::A07DA06:::B06AC03:::M01AB08:::L04AA39...|eltrombopag :::eluxadoline :::ecallantide :::et...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...|
| amlodipine| C08CA01|C08CA01:::C08CA17:::C08CA13:::C08CA06:::C08CA10...|amlodipine :::levamlodipine :::lercanidipine ::...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...|
| metformin| A10BA02|A10BA02:::A10BA01:::A10BB01:::A10BH04:::A10BB07...|metformin :::phenformin :::glyburide / metformi...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...|
| hydrogen peroxide| A01AB02|A01AB02:::S02AA06:::D10AE:::D11AX25:::D10AE01::...|hydrogen peroxide :::hydrogen peroxide; otic:::...|ATC 5th:::ATC 5th:::ATC 4th:::ATC 5th:::ATC 5th...|
| amoxicillin| J01CA04|J01CA04:::J01CA01:::J01CF02:::J01CF01:::J01CA51...|amoxicillin :::ampicillin :::cloxacillin :::dic...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...|
|magnesium hydroxide| A02AA04|A02AA04:::A12CC02:::D10AX30:::B05XA11:::A02AA02...|magnesium hydroxide :::magnesium sulfate :::alu...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...|
| antidepressant| N06A|N06A:::N05A:::N06AX:::N05AH02:::N06D:::N06CA:::...|ANTIDEPRESSANTS:::ANTIPSYCHOTICS:::Other antide...|ATC 3rd:::ATC 3rd:::ATC 4th:::ATC 5th:::ATC 3rd...|
+-------------------+--------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_atc|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[atc_code]|
|Language:|en|
|Size:|71.6 MB|
|Case sensitive:|false|
## References
Trained on ATC 2022 Codes dataset
---
layout: model
title: Catalan Lemmatizer
author: John Snow Labs
name: lemma
date: 2020-07-29 23:34:00 +0800
task: Lemmatization
language: ca
edition: Spark NLP 2.5.5
spark_version: 2.4
tags: [lemmatizer, ca]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ca_2.5.5_2.4_1596054394549.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ca_2.5.5_2.4_1596054394549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
lemmatizer = LemmatizerModel.pretrained("lemma", "ca") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica.")
```
```scala
...
val lemmatizer = LemmatizerModel.pretrained("lemma", "ca")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica."""]
lemma_df = nlu.load('ca.lemma').predict(text, output_level='document')
lemma_df.lemma.values[0]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=0, result='a', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=2, end=5, result='part', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=7, end=8, result='de', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=10, end=12, result='ser', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=14, end=15, result='ell', metadata={'sentence': '0'}, embeddings=[]),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Type:|lemmatizer|
|Compatibility:|Spark NLP 2.5.5+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[lemma]|
|Language:|ca|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: English BertForQuestionAnswering model (from Nakul24)
author: John Snow Labs
name: bert_qa_Spanbert_emotion_extraction
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Spanbert-emotion-extraction` is a English model orginally trained by `Nakul24`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Spanbert_emotion_extraction_en_4.0.0_3.0_1654179065087.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Spanbert_emotion_extraction_en_4.0.0_3.0_1654179065087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Spanbert_emotion_extraction","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_Spanbert_emotion_extraction","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.span_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_Spanbert_emotion_extraction|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|384.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Nakul24/Spanbert-emotion-extraction
---
layout: model
title: Pipeline to Detect PHI in medical text (biobert)
author: John Snow Labs
name: ner_deid_biobert_pipeline
date: 2023-03-20
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_biobert_pipeline_en_4.3.0_3.2_1679310594035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_biobert_pipeline_en_4.3.0_3.2_1679310594035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_biobert_pipeline", "en", "clinical/models")
text = '''A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_deid_biobert_pipeline", "en", "clinical/models")
val text = "A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.ner_biobert.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------------------|--------:|------:|:------------|-------------:|
| 0 | 2093-01-13 | 17 | 26 | DATE | 0.981 |
| 1 | David Hale | 29 | 38 | NAME | 0.77585 |
| 2 | Hendrickson | 53 | 63 | NAME | 0.9666 |
| 3 | Ora | 66 | 68 | LOCATION | 0.8723 |
| 4 | Oliveira | 91 | 98 | LOCATION | 0.7785 |
| 5 | Cocke County Baptist Hospital | 114 | 142 | LOCATION | 0.792 |
| 6 | Keats Street | 150 | 161 | LOCATION | 0.77305 |
| 7 | Phone | 164 | 168 | LOCATION | 0.7083 |
| 8 | Brothers | 253 | 260 | LOCATION | 0.9447 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.2 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English BertForQuestionAnswering Cased model (from SebastianS)
author: John Snow Labs
name: bert_qa_sebastians_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `SebastianS`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sebastians_finetuned_squad_en_4.0.0_3.0_1657186249406.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sebastians_finetuned_squad_en_4.0.0_3.0_1657186249406.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sebastians_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sebastians_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_sebastians_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/SebastianS/bert-finetuned-squad
---
layout: model
title: Translate English to Pijin Pipeline
author: John Snow Labs
name: translate_en_pis
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, pis, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `pis`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_pis_xx_2.7.0_2.4_1609698832184.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_pis_xx_2.7.0_2.4_1609698832184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_pis", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_pis", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.pis').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_pis|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Fast Neural Machine Translation Model from English to Tagalog
author: John Snow Labs
name: opus_mt_en_tl
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, tl, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `tl`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tl_xx_2.7.0_2.4_1609169442130.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tl_xx_2.7.0_2.4_1609169442130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_tl", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_tl", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.tl').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_tl|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Word2Vec Embeddings in Western Frisian (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, fy, open_source]
task: Embeddings
language: fy
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_fy_3.4.1_3.0_1647467525855.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_fy_3.4.1_3.0_1647467525855.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fy") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ik hâld fan spark nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fy")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ik hâld fan spark nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("fy.embed.w2v_cc_300d").predict("""Ik hâld fan spark nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|fy|
|Size:|306.1 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Legal Terms Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_terms_agreement_bert
date: 2022-11-25
tags: [en, legal, classification, agreement, terms, licensed, bert]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_terms_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `terms-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`terms-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_terms_agreement_bert_en_1.0.0_3.0_1669372149301.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_terms_agreement_bert_en_1.0.0_3.0_1669372149301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[terms-agreement]|
|[other]|
|[other]|
|[terms-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_terms_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.97 0.95 0.96 65
terms-agreement 0.91 0.94 0.93 34
accuracy - - 0.95 99
macro-avg 0.94 0.95 0.94 99
weighted-avg 0.95 0.95 0.95 99
```
---
layout: model
title: Language Detection & Identification Pipeline - 21 Languages (BiGRU)
author: John Snow Labs
name: detect_language_bigru_21
date: 2020-12-05
task: [Pipeline Public, Language Detection, Sentence Detection]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [language_detection, open_source, pipeline, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate.
We have designed and developed Deep Learning models using BiGRU architectures (mentioned in the model's name) in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias).
This pipeline can detect the following languages:
## Predicted Entities
`Bulgarian`, `Czech`, `Danish`, `German`, `Greek`, `English`, `Estonian`, `Finnish`, `French`, `Hungarian`, `Italian`, `Lithuanian`, `Latvian`, `Dutch`, `Polish`, `Portuguese`, `Romanian`, `Slovak`, `Slovenian`, `Spanish`, `Swedish`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/LANGUAGE_DETECTOR/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_bigru_21_xx_2.7.0_2.4_1607186103596.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/detect_language_bigru_21_xx_2.7.0_2.4_1607186103596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("detect_language_bigru_21", lang = "xx")
pipeline.annotate("French author who helped pioneer the science-fiction genre.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("detect_language_bigru_21", lang = "xx")
pipeline.annotate("French author who helped pioneer the science-fiction genre.")
```
{:.nlu-block}
```python
import nlu
text = ["French author who helped pioneer the science-fiction genre."]
lang_df = nlu.load("xx.classify.lang.bigru").predict(text)
lang_df
```
## Results
```bash
{'document': ['French author who helped pioneer the science-fiction genre.'],
'language': ['en']}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|detect_language_bigru_21|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- LanguageDetectorDL
---
layout: model
title: Meena's Tapas Table Understanding (Base)
author: John Snow Labs
name: table_qa_table_question_answering_tapas
date: 2022-09-30
tags: [en, table, qa, question, answering, open_source]
task: Table Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: TapasForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark.
Size of this model: Base
Has aggregation operations?: True
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_table_question_answering_tapas_en_4.2.0_3.0_1664530457710.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_table_question_answering_tapas_en_4.2.0_3.0_1664530457710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
json_data = """
{
"header": ["name", "money", "age"],
"rows": [
["Donald Trump", "$100,000,000", "75"],
["Elon Musk", "$20,000,000,000,000", "55"]
]
}
"""
queries = [
"Who earns less than 200,000,000?",
"Who earns 100,000,000?",
"How much money has Donald Trump?",
"How old are they?",
]
data = spark.createDataFrame([
[json_data, " ".join(queries)]
]).toDF("table_json", "questions")
document_assembler = MultiDocumentAssembler() \
.setInputCols("table_json", "questions") \
.setOutputCols("document_table", "document_questions")
sentence_detector = SentenceDetector() \
.setInputCols(["document_questions"]) \
.setOutputCol("questions")
table_assembler = TableAssembler()\
.setInputCols(["document_table"])\
.setOutputCol("table")
tapas = TapasForQuestionAnswering\
.pretrained("table_qa_table_question_answering_tapas","en")\
.setInputCols(["questions", "table"])\
.setOutputCol("answers")
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
table_assembler,
tapas
])
model = pipeline.fit(data)
model\
.transform(data)\
.selectExpr("explode(answers) AS answer")\
.select("answer")\
.show(truncate=False)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|answer |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} |
|{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|table_qa_table_question_answering_tapas|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|413.9 MB|
|Case sensitive:|false|
## References
https://huggingface.co/models?pipeline_tag=table-question-answering
---
layout: model
title: Legal Scope Clause Binary Classifier
author: John Snow Labs
name: legclf_scope_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `scope` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `scope`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_scope_clause_en_1.0.0_3.2_1660123969333.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_scope_clause_en_1.0.0_3.2_1660123969333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[scope]|
|[other]|
|[other]|
|[scope]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_scope_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.94 0.90 0.92 88
scope 0.73 0.83 0.77 29
accuracy - - 0.88 117
macro-avg 0.83 0.86 0.85 117
weighted-avg 0.89 0.88 0.88 117
```
---
layout: model
title: Legal Signers Clause Binary Classifier (CUAD dataset)
author: John Snow Labs
name: legclf_cuad_signers_clause
date: 2022-11-17
tags: [signers, en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the signers part of a documenttype. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
There are other models in this dataset with similar title, but the difference is the dataset it was trained on. This one was trained with `cuad` dataset.
## Predicted Entities
`signers`, `other`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_signers_clause_en_1.0.0_3.0_1668693373474.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_signers_clause_en_1.0.0_3.0_1668693373474.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[signers]|
|[other]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_cuad_signers_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.9 MB|
## References
CUAD dataset
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 1.00 1.00 73
signers 1.00 1.00 1.00 35
accuracy - - 1.00 108
macro-avg 1.00 1.00 1.00 108
weighted-avg 1.00 1.00 1.00 108
```
---
layout: model
title: English RobertaForQuestionAnswering (from SauravMaheshkar)
author: John Snow Labs
name: roberta_qa_roberta_base_chaii
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-chaii` is a English model originally trained by `SauravMaheshkar`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_chaii_en_4.0.0_3.0_1655730347590.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_chaii_en_4.0.0_3.0_1655730347590.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_chaii","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_chaii","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.chaii.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_chaii|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/SauravMaheshkar/roberta-base-chaii
---
layout: model
title: Fast Neural Machine Translation Model from Central Bikol to German
author: John Snow Labs
name: opus_mt_bcl_de
date: 2021-06-01
tags: [open_source, seq2seq, translation, bcl, de, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: bcl
target languages: de
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_de_xx_3.1.0_2.4_1622550430850.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_de_xx_3.1.0_2.4_1622550430850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_bcl_de", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_bcl_de", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Central Bikol.translate_to.German').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_bcl_de|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Word2Vec Embeddings in Cebuano (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-14
tags: [cc, embeddings, fastText, word2vec, ceb, open_source]
task: Embeddings
language: ceb
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ceb_3.4.1_3.0_1647290267903.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ceb_3.4.1_3.0_1647290267903.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ceb") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ganahan ko spark nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ceb")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ganahan ko spark nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ceb.embed.w2v_cc_300d").predict("""Ganahan ko spark nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|ceb|
|Size:|1.2 GB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: English image_classifier_vit__beans ViTForImageClassification from johnnydevriese
author: John Snow Labs
name: image_classifier_vit__beans
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit__beans` is a English model originally trained by johnnydevriese.
## Predicted Entities
`angular_leaf_spot`, `bean_rust`, `healthy`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit__beans_en_4.1.0_3.0_1660169646080.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit__beans_en_4.1.0_3.0_1660169646080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit__beans", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit__beans", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit__beans|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_triplet_shuffled_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223641296.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223641296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|460.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_shuffled_epochs_1_shard_1_squad2.0
---
layout: model
title: English T5ForConditionalGeneration Cased model (from dbernsohn)
author: John Snow Labs
name: t5_wikisql_en2sql
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5_wikisql_en2SQL` is a English model originally trained by `dbernsohn`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_wikisql_en2sql_en_4.3.0_3.0_1675157192158.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_wikisql_en2sql_en_4.3.0_3.0_1675157192158.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_wikisql_en2sql","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_wikisql_en2sql","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_wikisql_en2sql|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|288.2 MB|
## References
- https://huggingface.co/dbernsohn/t5_wikisql_en2SQL
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://github.com/DorBernsohn/CodeLM/tree/main/SQLM
- https://www.linkedin.com/in/dor-bernsohn-70b2b1146/
---
layout: model
title: Tamil XlmRoBertaForQuestionAnswering (from AswiN037)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_squad_tamil
date: 2022-06-23
tags: [ta, open_source, question_answering, xlmroberta]
task: Question Answering
language: ta
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-squad-tamil` is a Tamil model originally trained by `AswiN037`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_squad_tamil_ta_4.0.0_3.0_1655996786525.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_squad_tamil_ta_4.0.0_3.0_1655996786525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_squad_tamil","ta") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_squad_tamil","ta")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("ta.answer_question.squad.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_squad_tamil|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|ta|
|Size:|1.9 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AswiN037/xlm-roberta-squad-tamil
---
layout: model
title: English DistilBertForTokenClassification Cased model (from m3hrdadfi)
author: John Snow Labs
name: distilbert_tok_classifier_typo_detector
date: 2023-03-06
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `typo-detector-distilbert-en` is a English model originally trained by `m3hrdadfi`.
## Predicted Entities
`TYPO`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_tok_classifier_typo_detector_en_4.3.1_3.0_1678134333311.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_tok_classifier_typo_detector_en_4.3.1_3.0_1678134333311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_tok_classifier_typo_detector","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_tok_classifier_typo_detector","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_tok_classifier_typo_detector|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|244.1 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/m3hrdadfi/typo-detector-distilbert-en
- https://github.com/neuspell/neuspell
- https://github.com/m3hrdadfi/typo-detector/issues
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_roberta_FT_newsqa
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_FT_newsqa_en_4.0.0_3.0_1655738866363.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_FT_newsqa_en_4.0.0_3.0_1655738866363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_FT_newsqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_FT_newsqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.news.roberta.qa_roberta_ft_newsqa.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_FT_newsqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|458.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/roberta_FT_newsqa
---
layout: model
title: Hebrew BertForQuestionAnswering model (from tdklab)
author: John Snow Labs
name: bert_qa_hebert_finetuned_hebrew_squad
date: 2022-06-02
tags: [he, open_source, question_answering, bert]
task: Question Answering
language: he
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `hebert-finetuned-hebrew-squad` is a Hebrew model orginally trained by `tdklab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_hebert_finetuned_hebrew_squad_he_4.0.0_3.0_1654187940492.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_hebert_finetuned_hebrew_squad_he_4.0.0_3.0_1654187940492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_hebert_finetuned_hebrew_squad","he") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_hebert_finetuned_hebrew_squad","he")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("he.answer_question.squad.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_hebert_finetuned_hebrew_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|he|
|Size:|408.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/tdklab/hebert-finetuned-hebrew-squad
---
layout: model
title: English DistilBertForQuestionAnswering model (from FOFer)
author: John Snow Labs
name: distilbert_qa_FOFer_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `FOFer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_FOFer_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724124682.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_FOFer_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724124682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_FOFer_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_FOFer_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_FOFer").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_FOFer_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/FOFer/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Word Segmenter for Chinese
author: John Snow Labs
name: wordseg_ctb9
date: 2021-03-08
tags: [word_segmentation, open_source, chinese, wordseg_ctb9, zh]
task: Word Segmentation
language: zh
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
[WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) is based on maximum entropy probability model to detect word boundaries in Chinese text.
Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word.
In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_ctb9_zh_3.0.0_3.0_1615225768619.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_ctb9_zh_3.0.0_3.0_1615225768619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
word_segmenter = WordSegmenterModel.pretrained("wordseg_ctb9", "zh") .setInputCols(["sentence"]) .setOutputCol("token")
pipeline = Pipeline(stages=[document_assembler, word_segmenter])
ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"])
result = ws_model.transform(example)
```
```scala
val word_segmenter = WordSegmenterModel.pretrained("wordseg_ctb9", "zh")
.setInputCols(Array("sentence"))
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("从John Snow Labs你好! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""从John Snow Labs你好! ""]
token_df = nlu.load('zh.segment_words.ctb9').predict(text)
token_df
```
## Results
```bash
0 从
1 J
2 o
3 h
4 n
5 S
6 n
7 o
8 w
9 Labs
10 你
11 好
12 !
Name: token, dtype: object
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wordseg_ctb9|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[words_segmented]|
|Language:|zh|
---
layout: model
title: Summarize Clinical Notes in Layman Terms
author: John Snow Labs
name: summarizer_clinical_laymen
date: 2023-05-29
tags: [licensed, en, clinical, summarization, tensorflow]
task: Summarization
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MedicalSummarizer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with custom dataset by John Snow Labs to avoid using clinical jargon on the summaries. It can generate summaries up to 512 tokens given an input text (max 1024 tokens).
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_en_4.4.2_3.0_1685360017257.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_en_4.4.2_3.0_1685360017257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
summarizer = MedicalSummarizer.pretrained("summarizer_clinical_laymen", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("summary")\
.setMaxNewTokens(512)
pipeline = sparknlp.base.Pipeline(stages=[
document_assembler,
summarizer
])
text ="""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.\n\nPAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.\n\nPAST SURGICAL HISTORY: Pertinent for cholecystectomy.\n\nPSYCHOLOGICAL HISTORY: Negative.\n\nSOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.\n\nFAMILY HISTORY: Pertinent for obesity and hypertension.\n\nMEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.\n\nALLERGIES: She has no known drug allergies.\n\nREVIEW OF SYSTEMS: Negative.\n\nPHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
## Results
```bash
['This is a clinical note about a 34-year-old woman who is interested in having weight loss surgery. She has been overweight for over 20 years and wants to have more energy and improve her self-image. She has tried many diets and weight loss programs, but has not been successful in keeping the weight off. She has a history of hypertension and shortness of breath, but is not allergic to any medications. She will have an upper endoscopy and will be contacted by a nutritionist and social worker. The plan is to have her weight loss surgery through the gastric bypass, rather than Lap-Band.']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|summarizer_clinical_laymen|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|920.5 MB|
---
layout: model
title: Danish asr_xls_r_300m_nst_cv9 TFWav2Vec2ForCTC from chcaa
author: John Snow Labs
name: asr_xls_r_300m_nst_cv9
date: 2022-09-25
tags: [wav2vec2, da, audio, open_source, asr]
task: Automatic Speech Recognition
language: da
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_300m_nst_cv9` is a Danish model originally trained by chcaa.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_300m_nst_cv9_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_300m_nst_cv9_da_4.2.0_3.0_1664103508619.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_300m_nst_cv9_da_4.2.0_3.0_1664103508619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_xls_r_300m_nst_cv9", "da")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_xls_r_300m_nst_cv9", "da")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_xls_r_300m_nst_cv9|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|da|
|Size:|756.3 MB|
---
layout: model
title: Sentence Entity Resolver for LOINC (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_loinc_cased
date: 2021-12-24
tags: [en, clinical, licensed, entity_resolution, loinc]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted clinical NER entities to LOINC codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It is trained with augmented cased (unlowered) concept names since sbiobert model is cased.
## Predicted Entities
`LOINC`
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_cased_en_3.3.4_2.4_1640374998947.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_cased_en_3.3.4_2.4_1640374998947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical','en', 'clinical/models')\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
rad_ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
rad_ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(['Test'])
chunk2doc = Chunk2Doc() \
.setInputCols("ner_chunk") \
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")\
.setCaseSensitive(True)
resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc_cased", "en", "clinical/models") \
.setInputCols(["sbert_embeddings"])\
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
pipeline = Pipeline(stages = [
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
rad_ner,
rad_ner_converter,
chunk2doc,
sbert_embedder,
resolver
])
data = spark.createDataFrame([["""The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hemoglobin is 8.2%."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val rad_ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val rad_ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Test"))
val chunk2doc = Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
.setCaseSensitive(True)
val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc_cased", "en", "clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, rad_ner, rad_ner_converter, chunk2doc, sbert_embedder, resolver))
val data = Seq("The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hemoglabin is 8.2%.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.loinc_cased").predict("""The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hemoglobin is 8.2%.""")
```
## Results
```bash
+-------------------------------------+------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ner_chunk|entity| resolution| all_codes| resolutions|
+-------------------------------------+------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| BMI| Test| LP35925-4|[LP35925-4, 59574-4, BDYCRC, 73964-9, 59574-4,... |[Body mass index (BMI), Body mass index, Body circumference, Body muscle mass, Body mass index (BMI) [Percentile], ... |
| aspartate aminotransferase| Test| 14409-7|[14409-7, 1916-6, 16325-3, 16324-6, 43822-6, 308... |[Aspartate aminotransferase, Aspartate aminotransferase/Alanine aminotransferase, Alanine aminotransferase/Aspartate aminotransferase, Alanine aminotransferase, Aspartate aminotransferase [Prese... |
| alanine aminotransferase| Test| 16324-6|[16324-6, 16325-3, 14409-7, 1916-6, 59245-1, 30... |[Alanine aminotransferase, Alanine aminotransferase/Aspartate aminotransferase, Aspartate aminotransferase, Aspartate aminotransferase/Alanine aminotransferase, Alanine glyoxylate aminotransfer,... |
| hemoglobin| Test| 14775-1|[14775-1, 16931-8, 12710-0, 29220-1, 15082-1, 72... |[Hemoglobin, Hematocrit/Hemoglobin, Hemoglobin pattern, Haptoglobin, Methemoglobin, Oxyhemoglobin, Hemoglobin test status, Verdohemoglobin, Hemoglobin A, Hemoglobin distribution width, Myoglobin,... |
+-------------------------------------+------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_loinc_cased|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[loinc_code]|
|Language:|en|
|Size:|648.5 MB|
|Case sensitive:|true|
---
layout: model
title: Financial Financial statements Item Binary Classifier
author: John Snow Labs
name: finclf_financial_statements_item
date: 2022-08-10
tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `financial_statements` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
## Predicted Entities
`other`, `financial_statements`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_financial_statements_item_en_1.0.0_3.2_1660154427604.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_financial_statements_item_en_1.0.0_3.2_1660154427604.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[financial_statements]|
|[other]|
|[other]|
|[financial_statements]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_financial_statements_item|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.6 MB|
## References
Weak labelling on documents from Edgar database
## Benchmarking
```bash
label precision recall f1-score support
financial_statements 0.86 0.96 0.91 1204
other 0.96 0.85 0.90 1254
accuracy - - 0.90 2458
macro-avg 0.91 0.91 0.90 2458
weighted-avg 0.91 0.90 0.90 2458
```
---
layout: model
title: Adverse Drug Events Binary Classifier (BioBERT)
author: John Snow Labs
name: bert_sequence_classifier_ade_augmented
date: 2022-07-27
tags: [clinical, licensed, public_health, ade, classifier, sequence_classification, en]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [BioBERT-based] (https://github.com/dmis-lab/biobert) classifier that can classify tweets reporting ADEs (Adverse Drug Events).
## Predicted Entities
`ADE`, `noADE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_ade_augmented_en_4.0.0_3.0_1658905698079.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_ade_augmented_en_4.0.0_3.0_1658905698079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade_augmented", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame(["So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st",
"Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"], StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("text", "class.result").show(truncate=False)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade_augmented", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq(Array("So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st",
"Religare Capital Ranbaxy has been accepting approval for Diovan since 2012")).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.adverse_drug_events").predict("""So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st""")
```
## Results
```bash
+-----------------------------------------------------------------------------------------------------------------------+-------+
|text |result |
+-----------------------------------------------------------------------------------------------------------------------+-------+
|So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st|[ADE] |
|Religare Capital Ranbaxy has been accepting approval for Diovan since 2012 |[noADE]|
+-----------------------------------------------------------------------------------------------------------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_ade_augmented|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## Benchmarking
```bash
label precision recall f1-score support
ADE 0.9696 0.9595 0.9645 2763
noADE 0.9670 0.9753 0.9712 3366
accuracy - - 0.9682 6129
macro-avg 0.9683 0.9674 0.9678 6129
weighted-avg 0.9682 0.9682 0.9682 6129
```
---
layout: model
title: italian Legal Roberta Embeddings
author: John Snow Labs
name: roberta_large_italian_legal
date: 2023-02-16
tags: [it, italian, embeddings, transformer, open_source, legal, tensorflow]
task: Embeddings
language: it
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-italian-roberta-large` is a Italian model originally trained by `joelito`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_italian_legal_it_4.2.4_3.0_1676557559157.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_italian_legal_it_4.2.4_3.0_1676557559157.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_large_italian_legal|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|it|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
https://huggingface.co/joelito/legal-italian-roberta-large
---
layout: model
title: Explain Document Pipeline for Spanish
author: John Snow Labs
name: explain_document_md
date: 2021-03-22
tags: [open_source, spanish, explain_document_md, pipeline, es]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: es
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_es_3.0.0_3.0_1616431976931.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_es_3.0.0_3.0_1616431976931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_md', lang = 'es')
annotations = pipeline.fullAnnotate(""Hola de John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_md", lang = "es")
val result = pipeline.fullAnnotate("Hola de John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hola de John Snow Labs! ""]
result_df = nlu.load('es.explain.md').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Hola de John Snow Labs! '] | ['Hola de John Snow Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | ['PART', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.5123000144958496,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_md|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|es|
---
layout: model
title: Legal Separation Agreement Document Classifier (Longformer)
author: John Snow Labs
name: legclf_separation_agreement
date: 2022-11-24
tags: [en, legal, classification, agreement, separation, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_separation_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `separation-agreement` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`separation-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_separation_agreement_en_1.0.0_3.0_1669294576564.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_separation_agreement_en_1.0.0_3.0_1669294576564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[separation-agreement]|
|[other]|
|[other]|
|[separation-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_separation_agreement|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.94 0.91 0.93 90
separation-agreement 0.82 0.88 0.85 42
accuracy - - 0.90 132
macro-avg 0.88 0.90 0.89 132
weighted-avg 0.90 0.90 0.90 132
```
---
layout: model
title: Pipeline to Mapping RxNorm Codes with Corresponding National Drug Codes (NDC)
author: John Snow Labs
name: rxnorm_ndc_mapping
date: 2022-06-27
tags: [rxnorm, ndc, pipeline, chunk_mapper, clinical, licensed, en]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline maps RXNORM codes to NDC codes without using any text data. You’ll just feed white space-delimited RXNORM codes and it will return the corresponding two different types of ndc codes which are called `package ndc` and `product ndc`.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapping_en_3.5.3_3.0_1656369648141.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapping_en_3.5.3_3.0_1656369648141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("rxnorm_ndc_mapping", "en", "clinical/models")
result= pipeline.fullAnnotate("1652674 259934")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("rxnorm_ndc_mapping", "en", "clinical/models")
val result= pipeline.fullAnnotate("1652674 259934")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.rxnorm_to_ndc.pipe").predict("""1652674 259934""")
```
## Results
```bash
{'document': ['1652674 259934'],
'package_ndc': ['62135-0625-60', '13349-0010-39'],
'product_ndc': ['46708-0499', '13349-0010'],
'rxnorm_code': ['1652674', '259934']}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|rxnorm_ndc_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|4.0 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ChunkMapperModel
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_by_ntp0102 TFWav2Vec2ForCTC from ntp0102
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_ntp0102` is a English model originally trained by ntp0102.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102_en_4.2.0_3.0_1664026602923.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102_en_4.2.0_3.0_1664026602923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|349.4 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Spanish RobertaForQuestionAnswering (from jamarju)
author: John Snow Labs
name: roberta_qa_roberta_large_bne_squad_2.0_es_jamarju
date: 2022-06-21
tags: [es, open_source, question_answering, roberta]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-bne-squad-2.0-es` is a Spanish model originally trained by `jamarju`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_bne_squad_2.0_es_jamarju_es_4.0.0_3.0_1655789415779.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_bne_squad_2.0_es_jamarju_es_4.0.0_3.0_1655789415779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_bne_squad_2.0_es_jamarju","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_large_bne_squad_2.0_es_jamarju","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squad.roberta.large.by_jamarju").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_large_bne_squad_2.0_es_jamarju|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|es|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/jamarju/roberta-large-bne-squad-2.0-es
- https://github.com/PlanTL-SANIDAD/lm-spanish
- https://github.com/ccasimiro88/TranslateAlignRetrieve
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from miamiya)
author: John Snow Labs
name: roberta_qa_miamiya_base_squad2_finetuned_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `miamiya`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_miamiya_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219308729.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_miamiya_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219308729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_miamiya_base_squad2_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_miamiya_base_squad2_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_miamiya_base_squad2_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/miamiya/roberta-base-squad2-finetuned-squad
---
layout: model
title: Hungarian Legal Roberta Embeddings
author: John Snow Labs
name: roberta_base_hungarian_legal
date: 2023-02-16
tags: [hu, hungarian, embeddings, transformer, open_source, legal, tensorflow]
task: Embeddings
language: hu
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-hungarian-roberta-base` is a Hungarian model originally trained by `joelito`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_hungarian_legal_hu_4.2.4_3.0_1676558480899.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_hungarian_legal_hu_4.2.4_3.0_1676558480899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_base_hungarian_legal|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|hu|
|Size:|416.0 MB|
|Case sensitive:|true|
## References
https://huggingface.co/joelito/legal-hungarian-roberta-base
---
layout: model
title: Word2Vec Embeddings in Sundanese (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, su, open_source]
task: Embeddings
language: su
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_su_3.4.1_3.0_1647459488324.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_su_3.4.1_3.0_1647459488324.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","su") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Abdi bogoh Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","su")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Abdi bogoh Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("su.embed.w2v_cc_300d").predict("""Abdi bogoh Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|su|
|Size:|185.9 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from Sachinkelenjaguri)
author: John Snow Labs
name: distilbert_qa_sa_qna
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sa_Qna` is a English model originally trained by `Sachinkelenjaguri`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sa_qna_en_4.3.0_3.0_1672775418180.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sa_qna_en_4.3.0_3.0_1672775418180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sa_qna","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sa_qna","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_sa_qna|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Sachinkelenjaguri/sa_Qna
---
layout: model
title: Translate English to Austronesian languages Pipeline
author: John Snow Labs
name: translate_en_map
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, map, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `map`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_map_xx_2.7.0_2.4_1609688461104.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_map_xx_2.7.0_2.4_1609688461104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_map", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_map", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.map').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_map|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011)
author: John Snow Labs
name: distilbert_token_classifier_autotrain_final_784824211
date: 2023-03-06
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824211` is a English model originally trained by `Lucifermorningstar011`.
## Predicted Entities
`9`, `0`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824211_en_4.3.1_3.0_1678134173949.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824211_en_4.3.1_3.0_1678134173949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824211","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824211","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_autotrain_final_784824211|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|244.1 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Lucifermorningstar011/autotrain-final-784824211
---
layout: model
title: Sentence Entity Resolver for billable ICD10-CM HCC codes
author: John Snow Labs
name: sbiobertresolve_icd10cm_augmented_billable_hcc
date: 2021-05-16
tags: [entity_resolution, clinical, licensed, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. The load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. It has been augmented with synonyms, four times richer than previous resolver. It also adds support of 7-digit codes with HCC status.
## Predicted Entities
Outputs 7-digit billable ICD codes. In the result, look for `aux_label` parameter in the metadata to get HCC status. The HCC status can be divided to get further information: `billable status`, `hcc status`, and `hcc score`.For example, in the example shared `below the billable status is 1`, `hcc status is 1`, and `hcc score is 8`.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_3.0.4_2.4_1621189647111.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_3.0.4_2.4_1621189647111.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_icd10cm_augmented_billable_hcc``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. ```PROBLEM``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \
.setInputCols(["document", "sbert_embeddings"]) \
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN").setReturnCosineDistances(True)
bert_pipeline_icd = Pipeline(stages = [document_assembler, sbert_embedder, icd10_resolver])
data = spark.createDataFrame([["metastatic lung cancer"]]).toDF("text")
results = bert_pipeline_icd.fit(data).transform(data)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models")
.setInputCols(Array("document", "sbert_embeddings"))
.setOutputCol("icd10cm_code")
.setDistanceFunction("EUCLIDEAN")
.setReturnCosineDistances(True)
val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver))
val data = Seq("metastatic lung cancer").toDF("text")
val result = bert_pipeline_icd.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10cm.augmented_billable").predict("""metastatic lung cancer""")
```
## Results
```bash
| | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances |
|---:|:-----------------------|:-------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------|
| 0 | metastatic lung cancer | C7800 | ['cancer metastatic to lung', 'metastasis from malignant tumor of lung', 'cancer metastatic to left lung', 'history of cancer metastatic to lung', 'metastatic cancer', 'history of cancer metastatic to lung (situation)', 'metastatic adenocarcinoma to bilateral lungs', 'cancer metastatic to chest wall', 'metastatic malignant neoplasm to left lower lobe of lung', 'metastatic carcinoid tumour', 'cancer metastatic to respiratory tract', 'metastatic carcinoid tumor'] | ['C7800', 'C349', 'C7801', 'Z858', 'C800', 'Z8511', 'C780', 'C798', 'C7802', 'C799', 'C7830', 'C7B00'] | ['1', '1', '8'] | ['0.0464', '0.0829', '0.0852', '0.0860', '0.0914', '0.0989', '0.1133', '0.1220', '0.1220', '0.1253', '0.1249', '0.1260'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10cm_augmented_billable_hcc|
|Compatibility:|Healthcare NLP 3.0.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icd10cm_code]|
|Language:|en|
|Case sensitive:|false|
---
layout: model
title: Oncology Pipeline for Therapies
author: John Snow Labs
name: oncology_therapy_pipeline
date: 2023-03-29
tags: [licensed, pipeline, oncology, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline includes Named-Entity Recognition and Assertion Status models to extract information from oncology texts. This pipeline focuses on entities related to therapies.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_therapy_pipeline_en_4.3.2_3.2_1680123025997.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_therapy_pipeline_en_4.3.2_3.2_1680123025997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("oncology_therapy_pipeline", "en", "clinical/models")
text = '''The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("oncology_therapy_pipeline", "en", "clinical/models")
val text = "The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.oncology_therpay.pipeline").predict("""The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_ara_base_artydiqa","ar") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_ara_base_artydiqa","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.answer_question.tydiqa.electra.base").predict("""ما هو اسمي؟|||"اسمي كلارا وأنا أعيش في بيركلي.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_ara_base_artydiqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|ar|
|Size:|504.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/wissamantoun/araelectra-base-artydiqa
---
layout: model
title: Explain Document Pipeline for Italian
author: John Snow Labs
name: explain_document_md
date: 2021-03-22
tags: [open_source, italian, explain_document_md, pipeline, it]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: it
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_it_3.0.0_3.0_1616430477970.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_it_3.0.0_3.0_1616430477970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_md', lang = 'it')
annotations = pipeline.fullAnnotate(""Ciao da John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_md", lang = "it")
val result = pipeline.fullAnnotate("Ciao da John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Ciao da John Snow Labs! ""]
result_df = nlu.load('it.explain.document').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Ciao da John Snow Labs! '] | ['Ciao da John Snow Labs!'] | ['Ciao', 'da', 'John', 'Snow', 'Labs!'] | ['Ciao', 'da', 'John', 'Snow', 'Labs!'] | ['VERB', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[-0.146050006151199,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_md|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|it|
---
layout: model
title: Chinese BertForMaskedLM Large Cased model (from genggui001)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_wwm_large_ext_fix_mlm` is a Chinese model originally trained by `genggui001`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm_zh_4.2.4_3.0_1670326139931.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm_zh_4.2.4_3.0_1670326139931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|1.2 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/genggui001/chinese_roberta_wwm_large_ext_fix_mlm
- https://github.com/ymcui/Chinese-BERT-wwm/issues/98
- https://github.com/genggui001/chinese_roberta_wwm_large_ext_fix_mlm
---
layout: model
title: English DistilBertForQuestionAnswering model (from machine2049) Duorc
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_duorc_
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-duorc_distilbert` is a English model originally trained by `machine2049`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_duorc__en_4.0.0_3.0_1654723876220.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_duorc__en_4.0.0_3.0_1654723876220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_duorc_","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_duorc_","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.base_uncased.by_machine2049").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_duorc_|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/machine2049/distilbert-base-uncased-finetuned-duorc_distilbert
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from kaouther)
author: John Snow Labs
name: distilbert_qa_kaouther_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kaouther`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaouther_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771677866.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaouther_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771677866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaouther_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaouther_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_kaouther_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/kaouther/distilbert-base-uncased-finetuned-squad
---
layout: model
title: French CamemBert Embeddings (from gulabpatel)
author: John Snow Labs
name: camembert_embeddings_new_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `new-dummy-model` is a French model orginally trained by `gulabpatel`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_new_generic_model_fr_3.4.4_3.0_1653991782298.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_new_generic_model_fr_3.4.4_3.0_1653991782298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_new_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_new_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_new_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/gulabpatel/new-dummy-model
---
layout: model
title: German Financial Bert Word Embeddings
author: John Snow Labs
name: bert_sentence_embeddings_financial
date: 2022-05-04
tags: [bert, embeddings, de, open_source, financial]
task: Embeddings
language: de
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Although in the name of the model you will see the word `sentence`, this is a Word Embeddings Model.
Financial Pretrained BERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `german-financial-statements-bert` is a German model orginally trained by `fabianrausch`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sentence_embeddings_financial_de_3.4.2_3.0_1651678415089.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sentence_embeddings_financial_de_3.4.2_3.0_1651678415089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_sentence_embeddings_financial","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Spark-NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_sentence_embeddings_financial","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Spark-NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sentence_embeddings_financial|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|409.6 MB|
|Case sensitive:|true|
---
layout: model
title: English asr_wav2vec2_base_timit_ali_hasan_colab_EX2 TFWav2Vec2ForCTC from ali221000262
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_ali_hasan_colab_EX2` is a English model originally trained by ali221000262.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2_en_4.2.0_3.0_1664038560611.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2_en_4.2.0_3.0_1664038560611.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|354.9 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Named Entity Recognition (NER) Model in Norwegian (Norne 6B 100)
author: John Snow Labs
name: norne_6B_100
date: 2020-05-06
task: Named Entity Recognition
language: "no"
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [ner, nn, nb, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Norne is a Named Entity Recognition (or NER) model of Norvegian, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Norne 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline.
{:.h2_title}
## Predicted Entities
Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Derived-`DRV`, Product-`PROD`, Geo-political Entities Location-`GPE_LOC`, Geo-political Entities Organization-`GPE-ORG`, Event-`EVT`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_NO/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_NO.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/norne_6B_300_no_2.5.0_2.4_1588781290264.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/norne_6B_300_no_2.5.0_2.4_1588781290264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained('glove_6B_100') \
.setInputCols(['document', 'token']) \
.setOutputCol('embeddings')
ner_model = NerDLModel.pretrained("norne_6B_100", "no") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.']], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("glove_6B_100")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("norne_6B_100", "no")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella."""]
ner_df = nlu.load('no.ner.norne.glove.6B_100').predict(text, output_level = "chunk")
ner_df[["entities", "entities_confidence"]]
```
{:.h2_title}
## Results
```bash
+-------------------------------+---------+
|chunk |ner_label|
+-------------------------------+---------+
|William Henry Gates III |PER |
|Microsoft Corporation |ORG |
|Microsoft |ORG |
|Gates |PER |
|Seattle |GPE_LOC |
|Washington |GPE_LOC |
|Microsoft |ORG |
|Paul Allen |PER |
|Albuquerque |GPE_LOC |
|New Mexico |GPE_LOC |
|Gates |PER |
|Gates |PER |
|Gates |PER |
|Microsoft |ORG |
|Bill & Melinda Gates Foundation|ORG |
|Melinda Gates |PER |
|Ray Ozzie |PER |
|Craig Mundie |PER |
|Han |PER |
|Microsoft |ORG |
+-------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|norne_6B_100|
|Type:|ner|
|Compatibility:| Spark NLP 2.5.0+|
|Edition:|Official|
|License:|Open Source|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|no|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.559.pdf](https://www.aclweb.org/anthology/2020.lrec-1.559.pdf)
---
layout: model
title: Lemmatizer (Lithuanian, SpacyLookup)
author: John Snow Labs
name: lemma_spacylookup
date: 2022-03-03
tags: [open_source, lemmatizer, lt]
task: Lemmatization
language: lt
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Lithuanian Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_lt_3.4.1_3.0_1646316598333.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_lt_3.4.1_3.0_1646316598333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","lt") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["Jūs nesate geresnis už mane"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","lt")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer))
val data = Seq("Jūs nesate geresnis už mane").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("lt.lemma").predict("""Jūs nesate geresnis už mane""")
```
## Results
```bash
+------------------------------+
|result |
+------------------------------+
|[Jūs, nebūti, geras, už, mane]|
+------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma_spacylookup|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|lt|
|Size:|2.6 MB|
---
layout: model
title: Spanish BertForTokenClassification Cased model (from luch0247)
author: John Snow Labs
name: bert_token_classifier_autotrain_lucy_alicorp_1356152290
date: 2022-11-30
tags: [es, open_source, bert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: es
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-Lucy-Alicorp-1356152290` is a Spanish model originally trained by `luch0247`.
## Predicted Entities
`C`, `NM`, `VRB`, `CR`, `QT`, `DB`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_lucy_alicorp_1356152290_es_4.2.4_3.0_1669814335691.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_lucy_alicorp_1356152290_es_4.2.4_3.0_1669814335691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_lucy_alicorp_1356152290","es") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_lucy_alicorp_1356152290","es")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_autotrain_lucy_alicorp_1356152290|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|410.2 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/luch0247/autotrain-Lucy-Alicorp-1356152290
---
layout: model
title: Fast Neural Machine Translation Model from English to Twi
author: John Snow Labs
name: opus_mt_en_tw
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, tw, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `tw`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tw_xx_2.7.0_2.4_1609169865968.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tw_xx_2.7.0_2.4_1609169865968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_tw", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_tw", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.tw').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_tw|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English XLMRobertaForTokenClassification Base Cased model (from AI4Sec)
author: John Snow Labs
name: xlmroberta_ner_cyner_base
date: 2022-08-13
tags: [en, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cyner-xlm-roberta-base` is a English model originally trained by `AI4Sec`.
## Predicted Entities
`Vulnerability`, `Malware`, `System`, `Organization`, `Indicator`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cyner_base_en_4.1.0_3.0_1660422140565.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cyner_base_en_4.1.0_3.0_1660422140565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cyner_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cyner_base","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_cyner_base|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|780.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AI4Sec/cyner-xlm-roberta-base
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves TFWav2Vec2ForCTC from tonyalves
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves` is a English model originally trained by tonyalves.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves_en_4.2.0_3.0_1664109299079.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves_en_4.2.0_3.0_1664109299079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from huxxx657)
author: John Snow Labs
name: roberta_qa_base_finetuned_scrambled_squad_15
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-15` is a English model originally trained by `huxxx657`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_15_en_4.3.0_3.0_1674216826395.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_15_en_4.3.0_3.0_1674216826395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_15","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_15","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_finetuned_scrambled_squad_15|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-15
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42_en_4.3.0_3.0_1674213511887.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42_en_4.3.0_3.0_1674213511887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|447.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-42
---
layout: model
title: Word Segmenter for Chinese
author: John Snow Labs
name: wordseg_pku
date: 2021-03-09
tags: [word_segmentation, open_source, chinese, wordseg_pku, zh]
task: Word Segmentation
language: zh
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: WordSegmenterModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
[WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) is based on maximum entropy probability model to detect word boundaries in Chinese text.
Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word.
In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_pku_zh_3.0.0_3.0_1615292332841.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_pku_zh_3.0.0_3.0_1615292332841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
word_segmenter = WordSegmenterModel.pretrained("wordseg_pku", "zh") .setInputCols(["sentence"]) .setOutputCol("token")
pipeline = Pipeline(stages=[document_assembler, word_segmenter])
ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"])
result = ws_model.transform(example)
```
```scala
val word_segmenter = WordSegmenterModel.pretrained("wordseg_pku", "zh")
.setInputCols(Array("sentence"))
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("从John Snow Labs你好! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""从John Snow Labs你好! ""]
token_df = nlu.load('zh.segment_words.pku').predict(text)
token_df
```
## Results
```bash
0 从
1 Jo
2 hn
3 Sn
4 ow
5 La
6 bs
7 你
8 好
9 !
Name: token, dtype: object
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wordseg_pku|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[words_segmented]|
|Language:|zh|
---
layout: model
title: English BertForQuestionAnswering model (from aodiniz)
author: John Snow Labs
name: bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-2_H-512_A-8_cord19-200616_squad2` is a English model orginally trained by `aodiniz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2_en_4.0.0_3.0_1654185223441.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2_en_4.0.0_3.0_1654185223441.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_cord19.bert.uncased_2l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|83.4 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/aodiniz/bert_uncased_L-2_H-512_A-8_cord19-200616_squad2
---
layout: model
title: Italian T5ForConditionalGeneration Small Cased model (from it5)
author: John Snow Labs
name: t5_it5_efficient_small_el32_informal_to_formal
date: 2023-01-30
tags: [it, open_source, t5, tensorflow]
task: Text Generation
language: it
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-informal-to-formal` is a Italian model originally trained by `it5`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_informal_to_formal_it_4.3.0_3.0_1675103414416.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_informal_to_formal_it_4.3.0_3.0_1675103414416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_informal_to_formal","it") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_informal_to_formal","it")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_it5_efficient_small_el32_informal_to_formal|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|it|
|Size:|593.5 MB|
## References
- https://huggingface.co/it5/it5-efficient-small-el32-informal-to-formal
- https://github.com/stefan-it
- https://arxiv.org/abs/2203.03759
- https://gsarti.com
- https://malvinanissim.github.io
- https://arxiv.org/abs/2109.10686
- https://github.com/gsarti/it5
- https://paperswithcode.com/sota?task=Informal-to-formal+Style+Transfer&dataset=XFORMAL+%28Italian+Subset%29
---
layout: model
title: Legal Powers Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_powers_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, powers, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Powers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Powers`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_powers_bert_en_1.0.0_3.0_1678050727123.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_powers_bert_en_1.0.0_3.0_1678050727123.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Powers]|
|[Other]|
|[Other]|
|[Powers]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_powers_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.3 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.90 0.95 0.92 19
Powers 0.90 0.82 0.86 11
accuracy - - 0.90 30
macro-avg 0.90 0.88 0.89 30
weighted-avg 0.90 0.90 0.90 30
```
---
layout: model
title: Korean BertForMaskedLM Base Cased model (from kykim)
author: John Snow Labs
name: bert_embeddings_kor_base
date: 2022-12-02
tags: [ko, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ko
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-kor-base` is a Korean model originally trained by `kykim`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_kor_base_ko_4.2.4_3.0_1670019610924.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_kor_base_ko_4.2.4_3.0_1670019610924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kor_base","ko") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kor_base","ko")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_kor_base|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ko|
|Size:|443.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/kykim/bert-kor-base
- https://github.com/kiyoungkim1/LM-kor
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from moghis)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_panx_de_data
date: 2022-08-14
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de-data` is a German model originally trained by `moghis`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_panx_de_data_de_4.1.0_3.0_1660438168971.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_panx_de_data_de_4.1.0_3.0_1660438168971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_panx_de_data","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_panx_de_data","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_panx_de_data|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/moghis/xlm-roberta-base-finetuned-panx-de-data
---
layout: model
title: French CamemBert Embeddings (from kaushikacharya)
author: John Snow Labs
name: camembert_embeddings_kaushikacharya_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `kaushikacharya`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_kaushikacharya_generic_model_fr_3.4.4_3.0_1653989224371.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_kaushikacharya_generic_model_fr_3.4.4_3.0_1653989224371.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_kaushikacharya_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_kaushikacharya_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_kaushikacharya_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/kaushikacharya/dummy-model
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10_en_4.0.0_3.0_1655731588944.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10_en_4.0.0_3.0_1655731588944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|416.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-10
---
layout: model
title: Pipeline to Resolve Medication Codes
author: John Snow Labs
name: medication_resolver_pipeline
date: 2023-04-10
tags: [resolver, snomed, umls, rxnorm, ndc, ade, en, licensed, pipeline]
task: Entity Resolution
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pretrained resolver pipeline to extract medications and resolve their adverse reactions (ADE), RxNorm, UMLS, NDC, SNOMED CT codes, and action/treatments in clinical text.
Action/treatments are available for branded medication, and SNOMED codes are available for non-branded medication.
This pipeline can be used as Lightpipeline (with `annotate/fullAnnotate`). You can use `medication_resolver_transform_pipeline` for Spark transform.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/medication_resolver_pipeline_en_4.3.2_3.0_1681151954032.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/medication_resolver_pipeline_en_4.3.2_3.0_1681151954032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
med_resolver_pipeline = PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models")
text = """The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet."""
result = med_resolver_pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val med_resolver_pipeline = new PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models")
val result = med_resolver_pipeline.fullAnnotate("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.medication").predict("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""")
```
## Results
```bash
| | chunks | entities | ADE | RxNorm | Action | Treatment | UMLS | SNOMED_CT | NDC_Product | NDC_Package |
|---:|:-----------------------------|:-----------|:----------------------------|---------:|:---------------------------|:-------------------------------------------|:---------|:------------|:--------------|:--------------|
| 0 | Amlodopine Vallarta 10-320mg | DRUG | Gynaecomastia | 722131 | NONE | NONE | C1949334 | 425838008 | 00093-7693 | 00093-7693-56 |
| 1 | Eviplera | DRUG | Anxiety | 217010 | Inhibitory Bone Resorption | Osteoporosis | C0720318 | NONE | NONE | NONE |
| 2 | Lescol 40 MG | DRUG | NONE | 103919 | Hypocholesterolemic | Heterozygous Familial Hypercholesterolemia | C0353573 | NONE | 00078-0234 | 00078-0234-05 |
| 3 | Everolimus 1.5 mg tablet | DRUG | Acute myocardial infarction | 2056895 | NONE | NONE | C4723581 | NONE | 00054-0604 | 00054-0604-21 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|medication_resolver_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|3.2 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
- TextMatcherModel
- ChunkMergeModel
- ChunkMapperModel
- ChunkMapperModel
- ChunkMapperFilterer
- Chunk2Doc
- BertSentenceEmbeddings
- SentenceEntityResolverModel
- ResolverMerger
- ResolverMerger
- ChunkMapperModel
- ChunkMapperModel
- ChunkMapperModel
- ChunkMapperModel
- ChunkMapperModel
- ChunkMapperModel
- Finisher
---
layout: model
title: Fast Neural Machine Translation Model from Arabic to Esperanto
author: John Snow Labs
name: opus_mt_ar_eo
date: 2021-06-01
tags: [open_source, seq2seq, translation, ar, eo, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: ar
target languages: eo
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_eo_xx_3.1.0_2.4_1622554268155.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_eo_xx_3.1.0_2.4_1622554268155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ar_eo", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ar_eo", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Arabic.translate_to.Esperanto').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ar_eo|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Emotional Stressor Classifier (BERT)
author: John Snow Labs
name: bert_sequence_classifier_stressor
date: 2022-07-27
tags: [stressor, public_health, en, licensed, sequence_classification]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [bioBERT](https://nlp.johnsnowlabs.com/2022/07/18/biobert_pubmed_base_cased_v1.2_en_3_0.html) based classifier that can classify source of emotional stress in text.
## Predicted Entities
`Family_Issues`, `Financial_Problem`, `Health_Fatigue_or_Physical Pain`, `Other`, `School`, `Work`, `Social_Relationships`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_STRESS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_stressor_en_4.0.0_3.0_1658923809554.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_stressor_en_4.0.0_3.0_1658923809554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_stressor", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame([["All the panic about the global pandemic has been stressing me out!"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_stressor", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val data = Seq("All the panic about the global pandemic has been stressing me out!").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.stressor").predict("""All the panic about the global pandemic has been stressing me out!""")
```
## Results
```bash
+------------------------------------------------------------------+-----------------------------------+
|text |class |
+------------------------------------------------------------------+-----------------------------------+
|All the panic about the global pandemic has been stressing me out!|[Health, Fatigue, or Physical Pain]|
+------------------------------------------------------------------+-----------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_stressor|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## Benchmarking
```bash
label precision recall f1-score support
Family Issues 0.80 0.87 0.84 161
Financial Problem 0.87 0.83 0.85 126
Health, Fatigue, or Physical Pain 0.75 0.81 0.78 168
Other 0.82 0.80 0.81 384
School 0.89 0.91 0.90 127
Social Relationships 0.83 0.71 0.76 133
Work 0.87 0.89 0.88 271
accuracy - - 0.83 1370
macro-avg 0.83 0.83 0.83 1370
weighted-avg 0.83 0.83 0.83 1370
```
---
layout: model
title: Question classification of open-domain and fact-based questions Pipeline - TREC6
author: John Snow Labs
name: classifierdl_use_trec6_pipeline
date: 2021-01-08
task: [Text Classification, Pipeline Public]
language: en
nav_key: models
edition: Spark NLP 2.7.1
spark_version: 2.4
tags: [classifier, text_classification, en, open_source, pipeline]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Classify open-domain, fact-based questions into one of the following broad semantic categories: Abbreviation, Description, Entities, Human Beings, Locations or Numeric Values.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_TREC/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_TREC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec6_pipeline_en_2.7.1_2.4_1610119335714.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec6_pipeline_en_2.7.1_2.4_1610119335714.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("classifierdl_use_trec6_pipeline", lang = "en")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("classifierdl_use_trec6_pipeline", lang = "en")
```
## Results
```bash
+------------------------------------------------------------------------------------------------+------------+
|document |class |
+------------------------------------------------------------------------------------------------+------------+
|When did the construction of stone circles begin in the UK? | NUM |
+------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_use_trec6_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.1+|
|Edition:|Official|
|Language:|en|
---
layout: model
title: Translate English to Luo (Kenya and Tanzania) Pipeline
author: John Snow Labs
name: translate_en_luo
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, luo, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `luo`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_luo_xx_2.7.0_2.4_1609689219424.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_luo_xx_2.7.0_2.4_1609689219424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_luo", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_luo", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.luo').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_luo|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_large_xlsr_53_gpt TFWav2Vec2ForCTC from voidful
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_gpt
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_gpt` is a English model originally trained by voidful.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_gpt_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_gpt_en_4.2.0_3.0_1664095296861.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_gpt_en_4.2.0_3.0_1664095296861.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_gpt', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_gpt", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_gpt|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.3 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Abkhazian asr_hf_challenge_test TFWav2Vec2ForCTC from Iskaj
author: John Snow Labs
name: asr_hf_challenge_test
date: 2022-09-24
tags: [wav2vec2, ab, audio, open_source, asr]
task: Automatic Speech Recognition
language: ab
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_hf_challenge_test` is a Abkhazian model originally trained by Iskaj.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_hf_challenge_test_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_hf_challenge_test_ab_4.2.0_3.0_1664021278980.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_hf_challenge_test_ab_4.2.0_3.0_1664021278980.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_hf_challenge_test", "ab")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_hf_challenge_test", "ab")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_hf_challenge_test|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|ab|
|Size:|446.6 KB|
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from arjunth2001)
author: John Snow Labs
name: roberta_qa_priv_qna
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `priv_qna` is a English model originally trained by `arjunth2001`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_priv_qna_en_4.3.0_3.0_1674211774365.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_priv_qna_en_4.3.0_3.0_1674211774365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_priv_qna","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_priv_qna","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_priv_qna|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.6 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/arjunth2001/priv_qna
---
layout: model
title: English BertForQuestionAnswering Small Cased model (from motiondew)
author: John Snow Labs
name: bert_qa_sd2_small
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-sd2-small` is a English model originally trained by `motiondew`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sd2_small_en_4.0.0_3.0_1657188173526.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sd2_small_en_4.0.0_3.0_1657188173526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sd2_small","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sd2_small","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_sd2_small|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/motiondew/bert-sd2-small
---
layout: model
title: Translate Niger-Kordofanian languages to English Pipeline
author: John Snow Labs
name: translate_nic_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, nic, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `nic`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_nic_en_xx_2.7.0_2.4_1609699199544.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_nic_en_xx_2.7.0_2.4_1609699199544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_nic_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_nic_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.nic.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_nic_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Extract Biomarkers and Their Results
author: John Snow Labs
name: ner_oncology_biomarker_healthcare_pipeline
date: 2023-03-08
tags: [licensed, clinical, oncology, en, ner, biomarker]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_oncology_biomarker_healthcare](https://nlp.johnsnowlabs.com/2023/01/11/ner_oncology_biomarker_healthcare_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_healthcare_pipeline_en_4.3.0_3.2_1678269721297.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_healthcare_pipeline_en_4.3.0_3.2_1678269721297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_oncology_biomarker_healthcare_pipeline", "en", "clinical/models")
text = '''he results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_oncology_biomarker_healthcare_pipeline", "en", "clinical/models")
val text = "he results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%."
val result = pipeline.fullAnnotate(text)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_finetuned_squad2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_finetuned_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_large_finetuned_squad2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/phiyodr/roberta-large-finetuned-squad2
- https://rajpurkar.github.io/SQuAD-explorer/
- https://arxiv.org/abs/1907.11692
- https://arxiv.org/abs/1806.03822
- https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
- https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from sasuke)
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_squad1
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad1` is a English model originally trained by `sasuke`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad1_en_4.3.0_3.0_1672773467636.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad1_en_4.3.0_3.0_1672773467636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_squad1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/sasuke/distilbert-base-uncased-finetuned-squad1
---
layout: model
title: Legal Warrant Agreement Document Binary Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_warrant_agreement_bert
date: 2022-12-18
tags: [en, legal, classification, licensed, document, bert, warrant, agreement, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_warrant_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `warrant-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`warrant-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_warrant_agreement_bert_en_1.0.0_3.0_1671393844438.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_warrant_agreement_bert_en_1.0.0_3.0_1671393844438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[warrant-agreement]|
|[other]|
|[other]|
|[warrant-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_warrant_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.98 0.99 0.98 204
warrant-agreement 0.96 0.95 0.96 83
accuracy - - 0.98 287
macro-avg 0.97 0.97 0.97 287
weighted-avg 0.98 0.98 0.98 287
```
---
layout: model
title: Pipeline to Detect Anatomical Regions (MedicalBertForTokenClassifier)
author: John Snow Labs
name: bert_token_classifier_ner_anatomy_pipeline
date: 2023-03-20
tags: [anatomy, bertfortokenclassification, ner, en, licensed]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_anatomy](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_anatomy_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_pipeline_en_4.3.0_3.2_1679306174114.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_pipeline_en_4.3.0_3.2_1679306174114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_anatomy_pipeline", "en", "clinical/models")
text = '''This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.
General: Well-developed female, in no acute distress, afebrile.
HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.
Neck: No lymphadenopathy.
Chest: Clear.
Abdomen: Positive bowel sounds and soft.
Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_anatomy_pipeline", "en", "clinical/models")
val text = "This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.
General: Well-developed female, in no acute distress, afebrile.
HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.
Neck: No lymphadenopathy.
Chest: Clear.
Abdomen: Positive bowel sounds and soft.
Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.anatomy_pipeline").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.
General: Well-developed female, in no acute distress, afebrile.
HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.
Neck: No lymphadenopathy.
Chest: Clear.
Abdomen: Positive bowel sounds and soft.
Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luo_finetuned_ner","luo") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luo_finetuned_ner","luo")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_luo_finetuned_ner|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|luo|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-luo-finetuned-ner-luo
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://github.com/masakhane-io/masakhane-ner
---
layout: model
title: English image_classifier_vit_cifar10 ViTForImageClassification from alfredcs
author: John Snow Labs
name: image_classifier_vit_cifar10
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_cifar10` is a English model originally trained by alfredcs.
## Predicted Entities
`deer`, `bird`, `dog`, `horse`, `automobile`, `truck`, `frog`, `ship`, `airplane`, `cat`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_cifar10_en_4.1.0_3.0_1660167465918.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_cifar10_en_4.1.0_3.0_1660167465918.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_cifar10", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_cifar10", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_cifar10|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from ncduy)
author: John Snow Labs
name: distilbert_qa_base_cased_led_squad_finetuned_test
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad-test` is a English model originally trained by `ncduy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_finetuned_test_en_4.3.0_3.0_1672766662366.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_finetuned_test_en_4.3.0_3.0_1672766662366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_finetuned_test","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_finetuned_test","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_cased_led_squad_finetuned_test|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ncduy/distilbert-base-cased-distilled-squad-finetuned-squad-test
---
layout: model
title: Indonesian Part of Speech Tagger (from w11wo)
author: John Snow Labs
name: roberta_pos_indonesian_roberta_base_posp_tagger
date: 2022-05-03
tags: [roberta, pos, part_of_speech, id, open_source]
task: Part of Speech Tagging
language: id
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indonesian-roberta-base-posp-tagger` is a Indonesian model orginally trained by `w11wo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_pos_indonesian_roberta_base_posp_tagger_id_3.4.2_3.0_1651596272433.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_pos_indonesian_roberta_base_posp_tagger_id_3.4.2_3.0_1651596272433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_indonesian_roberta_base_posp_tagger","id") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Saya suka Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_indonesian_roberta_base_posp_tagger","id")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Saya suka Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("id.pos.indonesian_roberta_base_posp_tagger").predict("""Saya suka Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_pos_indonesian_roberta_base_posp_tagger|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|id|
|Size:|466.1 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/w11wo/indonesian-roberta-base-posp-tagger
- https://arxiv.org/abs/1907.11692
- https://hf.co/flax-community/indonesian-roberta-base
- https://hf.co/datasets/indonlu
- https://w11wo.github.io/
---
layout: model
title: Multilingual Representations for Indian Languages (MuRIL) - BERT Sentence Embedding pre-trained on 17 Indian languages
author: John Snow Labs
name: sent_bert_muril
date: 2021-09-01
tags: [xx, open_source, sentence_embeddings, muril, indian_languages]
task: Embeddings
language: xx
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses a BERT base architecture pretrained from scratch using the Wikipedia, Common Crawl, PMINDIA and Dakshina corpora for the following 17 Indian languages:
`Assamese`, `Bengali` , `English` , `Gujarati` , `Hindi` , `Kannada` , `Kashmiri` , `Malayalam` , `Marathi` , `Nepali` , `Oriya` , `Punjabi` , `Sanskrit` , `Sindhi` , `Tamil` , `Telugu` , `Urdu`
The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below :
- Monolingual Data : Publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
- Parallel Data : There are two types of parallel data :
- Translated Data : Translations of the above monolingual corpora obtained using the Google NMT pipeline. Translated segment pairs fed as input. Also, Publicly available PMINDIA corpus was used.
- Transliterated Data : Transliterations of Wikipedia obtained using the IndicTrans library. Transliterated segment pairs fed as input. Also, Publicly available Dakshina dataset was used.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_muril_xx_3.2.0_3.0_1630467991919.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_muril_xx_3.2.0_3.0_1630467991919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_muril", "xx") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
```
```scala
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_muril", "xx")
.setInputCols("sentence")
.setOutputCol("bert_sentence")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
sent_embeddings_df = nlu.load('en.embed_sentence.bert.muril').predict(text, output_level='sentence')
sent_embeddings_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_muril|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[bert_sentence]|
|Language:|xx|
|Case sensitive:|false|
## Data Source
[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.
[2]: [Wikipedia](https://www.tensorflow.org/datasets/catalog/wikipedia)
[3]: [Common Crawl](http://commoncrawl.org/the-data/)
[4]: [PMINDIA](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/index.html)
[5]: [Dakshina](https://github.com/google-research-datasets/dakshina)
The model is imported from: https://tfhub.dev/google/MuRIL/1
---
layout: model
title: Sentence Detection in Punjabi Text
author: John Snow Labs
name: sentence_detector_dl
date: 2021-08-30
tags: [pa, open_source, sentence_detection]
task: Sentence Detection
language: pa
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_pa_3.2.0_3.0_1630320087911.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_pa_3.2.0_3.0_1630320087911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ਅੰਗਰੇਜ਼ੀ ਪੜ੍ਹਨ ਦੇ ਪੈਰਾਗ੍ਰਾਫਾਂ ਦੇ ਇੱਕ ਮਹਾਨ ਸਰੋਤ ਦੀ ਭਾਲ ਕਰ ਰਹੇ ਹੋ?] |
|[ਤੁਸੀਂ ਸਹੀ ਜਗ੍ਹਾ ਤੇ ਆਏ ਹੋ.] |
|[ਇੱਕ ਤਾਜ਼ਾ ਅਧਿਐਨ ਅਨੁਸਾਰ ਅੱਜ ਦੇ ਨੌਜਵਾਨਾਂ ਵਿੱਚ ਪੜ੍ਹਨ ਦੀ ਆਦਤ ਤੇਜ਼ੀ ਨਾਲ ਘਟ ਰਹੀ ਹੈ। ਉਹ ਕੁਝ ਸਕਿੰਟਾਂ ਤੋਂ ਵੱਧ ਸਮੇਂ ਲਈ ਦਿੱਤੇ ਗਏ ਅੰਗਰੇਜ਼ੀ ਪੜ੍ਹਨ ਵਾਲੇ ਪੈਰੇ 'ਤੇ ਧਿਆਨ ਨਹੀਂ ਦੇ ਸਕਦੇ!]|
|[ਨਾਲ ਹੀ, ਪੜ੍ਹਨਾ ਸਾਰੀਆਂ ਪ੍ਰਤੀਯੋਗੀ ਪ੍ਰੀਖਿਆਵਾਂ ਦਾ ਇੱਕ ਅਨਿੱਖੜਵਾਂ ਅੰਗ ਸੀ ਅਤੇ ਹੈ.] |
|[ਇਸ ਲਈ, ਤੁਸੀਂ ਆਪਣੇ ਪੜ੍ਹਨ ਦੇ ਹੁਨਰ ਨੂੰ ਕਿਵੇਂ ਸੁਧਾਰਦੇ ਹੋ?] |
|[ਇਸ ਪ੍ਰਸ਼ਨ ਦਾ ਉੱਤਰ ਅਸਲ ਵਿੱਚ ਇੱਕ ਹੋਰ ਪ੍ਰਸ਼ਨ ਹੈ:] |
|[ਪੜ੍ਹਨ ਦੇ ਹੁਨਰ ਦੀ ਵਰਤੋਂ ਕੀ ਹੈ?] |
|[ਪੜ੍ਹਨ ਦਾ ਮੁੱਖ ਉਦੇਸ਼ 'ਅਰਥ ਬਣਾਉਣਾ' ਹੈ.] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sentence_detector_dl|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[sentences]|
|Language:|pa|
## Benchmarking
```bash
label Accuracy Recall Prec F1
0 0.98 1.00 0.96 0.98
```
---
layout: model
title: English asr_wav2vec2_large_robust_libri_960h TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_robust_libri_960h
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_robust_libri_960h` is a English model originally trained by facebook.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_robust_libri_960h_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_robust_libri_960h_en_4.2.0_3.0_1664039514827.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_robust_libri_960h_en_4.2.0_3.0_1664039514827.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_robust_libri_960h', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_robust_libri_960h", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_robust_libri_960h|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|757.6 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English asr_wav2vec2_xls_r_300m_german_english TFWav2Vec2ForCTC from aware-ai
author: John Snow Labs
name: pipeline_asr_wav2vec2_xls_r_300m_german_english
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_german_english` is a English model originally trained by aware-ai.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_german_english_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_german_english_en_4.2.0_3.0_1664111961756.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_german_english_en_4.2.0_3.0_1664111961756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_german_english', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_german_english", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xls_r_300m_german_english|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Clean patterns pipeline for English
author: John Snow Labs
name: clean_pattern
date: 2022-07-06
tags: [open_source, english, clean_pattern, pipeline, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The clean_pattern is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clean_pattern_en_4.0.0_3.0_1657137560119.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clean_pattern_en_4.0.0_3.0_1657137560119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('clean_pattern', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("clean_pattern", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.clean.pattern').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | normal |
|---:|:-----------|:-----------|:----------|:----------|
| 0 | ['Hello'] | ['Hello'] | ['Hello'] | ['Hello'] || | document | sentence | token | normal |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clean_pattern|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|28.8 KB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- NormalizerModel
---
layout: model
title: Bulgarian RobertaForMaskedLM Base Cased model (from iarfmoose)
author: John Snow Labs
name: roberta_embeddings_base_bulgarian
date: 2022-12-12
tags: [bg, open_source, roberta_embeddings, robertaformaskedlm]
task: Embeddings
language: bg
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bulgarian` is a Bulgarian model originally trained by `iarfmoose`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_base_bulgarian_bg_4.2.4_3.0_1670859176755.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_base_bulgarian_bg_4.2.4_3.0_1670859176755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_base_bulgarian","bg") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_base_bulgarian","bg")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_base_bulgarian|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|bg|
|Size:|473.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/iarfmoose/roberta-base-bulgarian
- https://arxiv.org/abs/1907.11692
- https://oscar-corpus.com/
- https://wortschatz.uni-leipzig.de/en/download/bulgarian
- https://wortschatz.uni-leipzig.de/en/download/bulgarian
---
layout: model
title: Pipeline to Detect Clinical Conditions (ner_eu_clinical_case - eu)
author: John Snow Labs
name: ner_eu_clinical_condition_pipeline
date: 2023-03-07
tags: [eu, clinical, licensed, ner, clinical_condition]
task: Named Entity Recognition
language: eu
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_eu_clinical_condition](https://nlp.johnsnowlabs.com/2023/02/06/ner_eu_clinical_condition_eu.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_eu_4.3.0_3.2_1678213509285.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_eu_4.3.0_3.2_1678213509285.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_eu_clinical_condition_pipeline", "eu", "clinical/models")
text = "
Gertaera honetatik bi hilabetetara, umea Larrialdietako Zerbitzura dator 4 egunetan zehar buruko mina eta bekokiko hantura azaltzeagatik, sukarrik izan gabe. Miaketan, haztapen mingarria duen bekokiko hantura bigunaz gain, ez da beste zeinurik azaltzen. Polakiuria eta tenesmo arina ere izan zuen egun horretan hematuriarekin batera. Geroztik sintomarik gabe dago.
"
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_eu_clinical_condition_pipeline", "eu", "clinical/models")
val text = "
Gertaera honetatik bi hilabetetara, umea Larrialdietako Zerbitzura dator 4 egunetan zehar buruko mina eta bekokiko hantura azaltzeagatik, sukarrik izan gabe. Miaketan, haztapen mingarria duen bekokiko hantura bigunaz gain, ez da beste zeinurik azaltzen. Polakiuria eta tenesmo arina ere izan zuen egun horretan hematuriarekin batera. Geroztik sintomarik gabe dago.
"
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | chunks | begin | end | entities | confidence |
|---:|:-----------|--------:|------:|:-------------------|-------------:|
| 0 | mina | 98 | 101 | clinical_condition | 0.8754 |
| 1 | hantura | 116 | 122 | clinical_condition | 0.8877 |
| 2 | sukarrik | 139 | 146 | clinical_condition | 0.9119 |
| 3 | mingarria | 178 | 186 | clinical_condition | 0.7381 |
| 4 | hantura | 203 | 209 | clinical_condition | 0.8805 |
| 5 | Polakiuria | 256 | 265 | clinical_condition | 0.6683 |
| 6 | sintomarik | 345 | 354 | clinical_condition | 0.9632 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_eu_clinical_condition_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|eu|
|Size:|1.1 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English image_classifier_vit_anomaly ViTForImageClassification from hafidber
author: John Snow Labs
name: image_classifier_vit_anomaly
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_anomaly` is a English model originally trained by hafidber.
## Predicted Entities
`abnormal`, `normal`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_anomaly_en_4.1.0_3.0_1660169901656.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_anomaly_en_4.1.0_3.0_1660169901656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_anomaly", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_anomaly", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_anomaly|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Legal Supplemental Indenture Document Binary Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_supplemental_indenture_agreement_bert
date: 2022-12-18
tags: [en, legal, classification, licensed, document, bert, supplemental, indenture, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_supplemental_indenture_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `supplemental-indenture` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`supplemental-indenture`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_supplemental_indenture_agreement_bert_en_1.0.0_3.0_1671393857190.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_supplemental_indenture_agreement_bert_en_1.0.0_3.0_1671393857190.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[supplemental-indenture]|
|[other]|
|[other]|
|[supplemental-indenture]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_supplemental_indenture_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.97 0.95 0.96 204
supplemental-indenture 0.91 0.95 0.93 111
accuracy - - 0.95 315
macro-avg 0.94 0.95 0.95 315
weighted-avg 0.95 0.95 0.95 315
```
---
layout: model
title: English image_classifier_vit_deit_flyswot ViTForImageClassification from davanstrien
author: John Snow Labs
name: image_classifier_vit_deit_flyswot
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_deit_flyswot` is a English model originally trained by davanstrien.
## Predicted Entities
`EDGE + SPINE`, `OTHER`, `PAGE + FOLIO`, `FLYSHEET`, `CONTAINER`, `CONTROL SHOT`, `COVER`, `SCROLL`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_deit_flyswot_en_4.1.0_3.0_1660166402706.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_deit_flyswot_en_4.1.0_3.0_1660166402706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_deit_flyswot", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_deit_flyswot", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_deit_flyswot|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.8 MB|
---
layout: model
title: English image_classifier_vit_llama_alpaca_snake ViTForImageClassification from osanseviero
author: John Snow Labs
name: image_classifier_vit_llama_alpaca_snake
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_llama_alpaca_snake` is a English model originally trained by osanseviero.
## Predicted Entities
`alpaca`, `llamas`, `snake`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_alpaca_snake_en_4.1.0_3.0_1660170191761.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_alpaca_snake_en_4.1.0_3.0_1660170191761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_llama_alpaca_snake", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_llama_alpaca_snake", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_llama_alpaca_snake|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab66 TFWav2Vec2ForCTC from hassnain
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab66
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab66` is a English model originally trained by hassnain.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab66_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab66_en_4.2.0_3.0_1664024830506.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab66_en_4.2.0_3.0_1664024830506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab66", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab66", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab66|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: Translate English to Tetela Pipeline
author: John Snow Labs
name: translate_en_tll
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, tll, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `tll`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tll_xx_2.7.0_2.4_1609699282699.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tll_xx_2.7.0_2.4_1609699282699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_tll", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_tll", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.tll').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_tll|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from ksabeh)
author: John Snow Labs
name: bert_qa_base_uncased_attribute_correction_mlm_titles
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-attribute-correction-mlm-titles` is a English model originally trained by `ksabeh`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_attribute_correction_mlm_titles_en_4.0.0_3.0_1657183812655.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_attribute_correction_mlm_titles_en_4.0.0_3.0_1657183812655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_attribute_correction_mlm_titles","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_attribute_correction_mlm_titles","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_attribute_correction_mlm_titles|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/ksabeh/bert-base-uncased-attribute-correction-mlm-titles
---
layout: model
title: Lewotobi RobertaForQuestionAnswering (from 21iridescent)
author: John Snow Labs
name: roberta_qa_distilroberta_base_finetuned_squad2_lwt
date: 2022-06-20
tags: [open_source, question_answering, roberta]
task: Question Answering
language: lwt
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-finetuned-squad2-lwt` is a Lewotobi model originally trained by `21iridescent`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_finetuned_squad2_lwt_lwt_4.0.0_3.0_1655728304909.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_finetuned_squad2_lwt_lwt_4.0.0_3.0_1655728304909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilroberta_base_finetuned_squad2_lwt","lwt") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_distilroberta_base_finetuned_squad2_lwt","lwt")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("lwt.answer_question.squadv2.roberta.distilled_base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_distilroberta_base_finetuned_squad2_lwt|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|lwt|
|Size:|307.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/21iridescent/distilroberta-base-finetuned-squad2-lwt
---
layout: model
title: French BertForMaskedLM Base Cased model (from Geotrend)
author: John Snow Labs
name: bert_embeddings_base_fr_cased
date: 2022-12-02
tags: [fr, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: fr
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-fr-cased` is a French model originally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_fr_cased_fr_4.2.4_3.0_1670017613286.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_fr_cased_fr_4.2.4_3.0_1670017613286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_fr_cased","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_fr_cased","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_fr_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|fr|
|Size:|393.6 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-fr-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Pipeline to Detect Chemicals in Medical text (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_chemicals_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, berfortokenclassification, chemicals, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_chemicals](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_chemicals_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_3.4.1_3.0_1647889424974.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_3.4.1_3.0_1647889424974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models")
pipeline.annotate("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.")
```
```scala
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models")
pipeline.annotate("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.chemicals_pipeline").predict("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""")
```
## Results
```bash
+---------------------------+---------+
|chunk |ner_label|
+---------------------------+---------+
|p - choloroaniline |CHEM |
|chlorhexidine - digluconate|CHEM |
|kanamycin |CHEM |
|colistin |CHEM |
|povidone - iodine |CHEM |
+---------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_chemicals_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverter
---
layout: model
title: English image_classifier_vit_vision_transformer_v3 ViTForImageClassification from mrgiraffe
author: John Snow Labs
name: image_classifier_vit_vision_transformer_v3
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vision_transformer_v3` is a English model originally trained by mrgiraffe.
## Predicted Entities
`chart`, `imagechart`, `notchart`, `pdfpagechart`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vision_transformer_v3_en_4.1.0_3.0_1660168553159.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vision_transformer_v3_en_4.1.0_3.0_1660168553159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_vision_transformer_v3", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_vision_transformer_v3", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_vision_transformer_v3|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Hindi XlmRoBertaForQuestionAnswering (from bhavikardeshna)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_base_hindi
date: 2022-06-23
tags: [hi, open_source, question_answering, xlmroberta]
task: Question Answering
language: hi
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-hindi` is a Hindi model originally trained by `bhavikardeshna`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_hindi_hi_4.0.0_3.0_1655990042129.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_hindi_hi_4.0.0_3.0_1655990042129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_hindi","hi") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_base_hindi","hi")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("hi.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_base_hindi|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|hi|
|Size:|885.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/bhavikardeshna/xlm-roberta-base-hindi
---
layout: model
title: Finnish asr_wav2vec2_xlsr_train_aug_bigLM_1B TFWav2Vec2ForCTC from RASMUS
author: John Snow Labs
name: asr_wav2vec2_xlsr_train_aug_bigLM_1B
date: 2022-09-25
tags: [wav2vec2, fi, audio, open_source, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_train_aug_bigLM_1B` is a Finnish model originally trained by RASMUS.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_train_aug_bigLM_1B_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_train_aug_bigLM_1B_fi_4.2.0_3.0_1664097486875.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_train_aug_bigLM_1B_fi_4.2.0_3.0_1664097486875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xlsr_train_aug_bigLM_1B", "fi")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xlsr_train_aug_bigLM_1B", "fi")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xlsr_train_aug_bigLM_1B|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|fi|
|Size:|3.6 GB|
---
layout: model
title: German asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779
date: 2022-09-26
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779` is a German model originally trained by jonatasgrosman.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779_de_4.2.0_3.0_1664191745283.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779_de_4.2.0_3.0_1664191745283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: Legal Enforceability Clause Binary Classifier
author: John Snow Labs
name: legclf_enforceability_clause
date: 2022-09-28
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `enforceability` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `enforceability`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_enforceability_clause_en_1.0.0_3.0_1664363141173.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_enforceability_clause_en_1.0.0_3.0_1664363141173.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[enforceability]|
|[other]|
|[other]|
|[enforceability]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_enforceability_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
enforceability 0.87 0.89 0.88 38
other 0.95 0.94 0.94 78
accuracy - - 0.92 116
macro-avg 0.91 0.92 0.91 116
weighted-avg 0.92 0.92 0.92 116
```
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Modified-BlueBERT-512` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512_en_4.0.0_3.0_1657108398297.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512_en_4.0.0_3.0_1657108398297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|408.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Modified-BlueBERT-512
---
layout: model
title: English BertForQuestionAnswering model (from rahulkuruvilla)
author: John Snow Labs
name: bert_qa_COVID_BERTa
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `COVID-BERTa` is a English model orginally trained by `rahulkuruvilla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_COVID_BERTa_en_4.0.0_3.0_1654176515744.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_COVID_BERTa_en_4.0.0_3.0_1654176515744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_COVID_BERTa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_COVID_BERTa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.covid_bert.a.by_rahulkuruvilla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_COVID_BERTa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/rahulkuruvilla/COVID-BERTa
---
layout: model
title: German asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman` is a German model originally trained by jonatasgrosman.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman_de_4.2.0_3.0_1664096062320.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman_de_4.2.0_3.0_1664096062320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman', lang = 'de')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman", lang = "de")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Translate English to Dravidian languages Pipeline
author: John Snow Labs
name: translate_en_dra
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, dra, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `dra`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_dra_xx_2.7.0_2.4_1609698809869.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_dra_xx_2.7.0_2.4_1609698809869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_dra", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_dra", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.dra').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_dra|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Hindi Bert Embeddings (from Geotrend)
author: John Snow Labs
name: bert_embeddings_bert_base_hi_cased
date: 2022-04-11
tags: [bert, embeddings, hi, open_source]
task: Embeddings
language: hi
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-hi-cased` is a Hindi model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_hi_cased_hi_3.4.2_3.0_1649673139297.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_hi_cased_hi_3.4.2_3.0_1649673139297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_hi_cased","hi") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_hi_cased","hi")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("hi.embed.bert_hi_cased").predict("""मुझे स्पार्क एनएलपी पसंद है""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_hi_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|hi|
|Size:|339.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-hi-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Mapping HCPCS Codes with Corresponding National Drug Codes (NDC) and Drug Brand Names
author: John Snow Labs
name: hcpcs_ndc_mapper
date: 2023-04-13
tags: [en, licensed, chunk_mappig, hcpcs, ndc, brand_name]
task: Chunk Mapping
language: en
edition: Healthcare NLP 4.4.0
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps HCPCS codes with their corresponding National Drug Codes (NDC) and their drug brand names.
## Predicted Entities
`ndc_code`, `brand_name`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/hcpcs_ndc_mapper_en_4.4.0_3.0_1681405950608.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/hcpcs_ndc_mapper_en_4.4.0_3.0_1681405950608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("hcpcs_chunk")
chunkerMapper = DocMapperModel.pretrained("hcpcs_ndc_mapper", "en", "clinical/models")\
.setInputCols(["hcpcs_chunk"])\
.setOutputCol("mappings")\
.setRels(["ndc_code", "brand_name"])
pipeline = Pipeline().setStages([document_assembler,
chunkerMapper])
model = pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
lp = LightPipeline(model)
res = lp.fullAnnotate(["Q5106", "J9211", "J7508"])
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")\
.setOutputCol("hcpcs_chunk")
val chunkerMapper = DocMapperModel
.pretrained("hcpcs_ndc_mapper", "en", "clinical/models")
.setInputCols(Array("hcpcs_chunk"))
.setOutputCol("mappings")
.setRels(Array("ndc_code", "brand_name"))
val mapper_pipeline = new Pipeline().setStages(Array(
document_assembler,
chunkerMapper))
val data = Seq(Array(["Q5106", "J9211", "J7508"])).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+-----------+-------------------------------------+----------+
|hcpcs_chunk|mappings |relation |
+-----------+-------------------------------------+----------+
|Q5106 |59353-0003-10 |ndc_code |
|Q5106 |RETACRIT (PF) 3000 U/1 ML |brand_name|
|J9211 |59762-2596-01 |ndc_code |
|J9211 |IDARUBICIN HYDROCHLORIDE (PF) 1 MG/ML|brand_name|
|J7508 |00469-0687-73 |ndc_code |
|J7508 |ASTAGRAF XL 5 MG |brand_name|
+-----------+-------------------------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|hcpcs_ndc_mapper|
|Compatibility:|Healthcare NLP 4.4.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|20.7 KB|
---
layout: model
title: German asr_wav2vec2_large_xlsr_53_german_by_oliverguhr TFWav2Vec2ForCTC from oliverguhr
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_oliverguhr` is a German model originally trained by oliverguhr.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr_de_4.2.0_3.0_1664104534440.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr_de_4.2.0_3.0_1664104534440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr', lang = 'de')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr", lang = "de")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Information Technology And Data Processing Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_information_technology_and_data_processing_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, information_technology_and_data_processing, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_information_technology_and_data_processing_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Information_Technology_and_Data_Processing or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Information_Technology_and_Data_Processing`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_information_technology_and_data_processing_bert_en_1.0.0_3.0_1678111859916.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_information_technology_and_data_processing_bert_en_1.0.0_3.0_1678111859916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Information_Technology_and_Data_Processing]|
|[Other]|
|[Other]|
|[Information_Technology_and_Data_Processing]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_information_technology_and_data_processing_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Information_Technology_and_Data_Processing 0.85 0.80 0.82 153
Other 0.79 0.85 0.82 141
accuracy - - 0.82 294
macro-avg 0.82 0.82 0.82 294
weighted-avg 0.83 0.82 0.82 294
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18)
author: John Snow Labs
name: distilbert_qa_base_uncased_becas_7
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-7` is a English model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_7_en_4.3.0_3.0_1672767624756.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_7_en_4.3.0_3.0_1672767624756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_7","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_7","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_becas_7|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-7
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from clisi2000)
author: John Snow Labs
name: xlmroberta_ner_clisi2000_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `clisi2000`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_clisi2000_base_finetuned_panx_de_4.1.0_3.0_1660431788167.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_clisi2000_base_finetuned_panx_de_4.1.0_3.0_1660431788167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_clisi2000_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_clisi2000_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_clisi2000_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/clisi2000/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: English DistilBertForQuestionAnswering model (from minhdang241)
author: John Snow Labs
name: distilbert_qa_robustqa_baseline_01
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robustqa-baseline-01` is a English model originally trained by `minhdang241`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_robustqa_baseline_01_en_4.0.0_3.0_1654728524154.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_robustqa_baseline_01_en_4.0.0_3.0_1654728524154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robustqa_baseline_01","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robustqa_baseline_01","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.base.by_minhdang241").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_robustqa_baseline_01|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/minhdang241/robustqa-baseline-01
---
layout: model
title: French CamemBert Embeddings (from yancong)
author: John Snow Labs
name: camembert_embeddings_yancong_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `yancong`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_yancong_generic_model_fr_3.4.4_3.0_1653990825081.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_yancong_generic_model_fr_3.4.4_3.0_1653990825081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_yancong_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_yancong_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_yancong_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/yancong/dummy-model
---
layout: model
title: Chinese BertForMaskedLM Cased model (from qinluo)
author: John Snow Labs
name: bert_embeddings_wo_chinese_plus
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `wobert-chinese-plus` is a Chinese model originally trained by `qinluo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_wo_chinese_plus_zh_4.2.4_3.0_1670023089360.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_wo_chinese_plus_zh_4.2.4_3.0_1670023089360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_wo_chinese_plus","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_wo_chinese_plus","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_wo_chinese_plus|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|467.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/qinluo/wobert-chinese-plus
- https://github.com/ZhuiyiTechnology/WoBERT
- https://github.com/JunnYu/WoBERT_pytorch
---
layout: model
title: English DistilBertForQuestionAnswering model (from abhilash1910)
author: John Snow Labs
name: distilbert_qa_squadv1
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-squadv1` is a English model originally trained by `abhilash1910`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squadv1_en_4.0.0_3.0_1654727758712.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squadv1_en_4.0.0_3.0_1654727758712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squadv1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squadv1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.by_abhilash1910").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_squadv1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/abhilash1910/distilbert-squadv1
---
layout: model
title: Sentence Entity Resolver for UMLS CUI Codes (Drug & Substance)
author: John Snow Labs
name: sbiobertresolve_umls_drug_substance
date: 2021-12-06
tags: [entity_resolution, en, clinical, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.3
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities to UMLS CUI codes. It is trained on `2021AB` UMLS dataset. The complete dataset has 127 different categories, and this model is trained on the `Clinical Drug`, `Pharmacologic Substance`, `Antibiotic`, `Hazardous or Poisonous Substance` categories using `sbiobert_base_cased_mli` embeddings.
## Predicted Entities
`Predicts UMLS codes for Drugs & Substances medical concepts`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_drug_substance_en_3.3.3_3.0_1638802613409.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_drug_substance_en_3.3.3_3.0_1638802613409.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_wineberto_italian_cased","it") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_wineberto_italian_cased","it")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Adoro Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("it.embed.wineberto_italian_cased").predict("""Adoro Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_wineberto_italian_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|it|
|Size:|415.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/vinhood/wineberto-italian-cased
- https://twitter.com/denocris
- https://www.linkedin.com/in/cristiano-de-nobili/
- https://www.vinhood.com/en/
---
layout: model
title: Legal Purchase price Clause Binary Classifier
author: John Snow Labs
name: legclf_purchase_price_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `purchase-price` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `purchase-price`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_purchase_price_clause_en_1.0.0_3.2_1660123871318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_purchase_price_clause_en_1.0.0_3.2_1660123871318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[purchase-price]|
|[other]|
|[other]|
|[purchase-price]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_purchase_price_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.99 0.99 0.99 81
purchase-price 0.98 0.98 0.98 47
accuracy - - 0.98 128
macro-avg 0.98 0.98 0.98 128
weighted-avg 0.98 0.98 0.98 128
```
---
layout: model
title: Pipeline to Detect PHI for deidentification purposes
author: John Snow Labs
name: ner_deid_subentity_augmented_i2b2_pipeline
date: 2023-03-13
tags: [deid, ner, phi, deidentification, licensed, i2b2, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_subentity_augmented_i2b2](https://nlp.johnsnowlabs.com/2021/11/29/ner_deid_subentity_augmented_i2b2_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_i2b2_pipeline_en_4.3.0_3.2_1678735152629.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_i2b2_pipeline_en_4.3.0_3.2_1678735152629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_subentity_augmented_i2b2_pipeline", "en", "clinical/models")
text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_deid_subentity_augmented_i2b2_pipeline", "en", "clinical/models")
val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.subentity_ner_augmented_i2b2.pipeline").predict("""Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.""")
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:------------------------------|--------:|------:|:--------------|-------------:|
| 0 | 2093-01-13 | 14 | 23 | DATE | 0.9997 |
| 1 | David Hale | 27 | 36 | DOCTOR | 0.9507 |
| 2 | Hendrickson Ora | 55 | 69 | PATIENT | 0.9981 |
| 3 | 7194334 | 78 | 84 | MEDICALRECORD | 0.9996 |
| 4 | 01/13/93 | 93 | 100 | DATE | 0.9992 |
| 5 | Oliveira | 110 | 117 | DOCTOR | 0.8822 |
| 6 | 25 | 121 | 122 | AGE | 0.5648 |
| 7 | 2079-11-09 | 150 | 159 | DATE | 0.9995 |
| 8 | Cocke County Baptist Hospital | 163 | 191 | HOSPITAL | 0.863775 |
| 9 | 0295 Keats Street | 195 | 211 | STREET | 0.754533 |
| 10 | 302-786-5227 | 221 | 232 | PHONE | 0.9697 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_subentity_augmented_i2b2_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Chinese Bert Embeddings (Base)
author: John Snow Labs
name: bert_embeddings_bert_base_chinese_jinyong
date: 2022-04-11
tags: [bert, embeddings, zh, open_source]
task: Embeddings
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-chinese-jinyong` is a Chinese model orginally trained by `yechen`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_chinese_jinyong_zh_3.4.2_3.0_1649670833638.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_chinese_jinyong_zh_3.4.2_3.0_1649670833638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_chinese_jinyong","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_chinese_jinyong","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.bert_base_chinese_jinyong").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_chinese_jinyong|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|384.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/yechen/bert-base-chinese-jinyong
---
layout: model
title: Detect Anatomical Regions
author: John Snow Labs
name: ner_anatomy_en
date: 2020-04-22
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.4.2
spark_version: 2.4
tags: [ner, en, clinical, licensed]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for anatomy terms. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
`Anatomical_system`, `Cell`, `Cellular_component`, `Developing_anatomical_structure`, `Immaterial_anatomical_entity`, `Multi-tissue_structure`, `Organ`, `Organism_subdivision`, `Organism_substance`, `Pathological_formation`, `Tissue`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_en_2.4.2_2.4_1587513307751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_en_2.4.2_2.4_1587513307751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %}
```python
...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_anatomy", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
results = model.transform(spark.createDataFrame([['This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.']], ["text"]))
```
```scala
...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_anatomy", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val result = pipeline.fit(Seq.empty["This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.
General: Well-developed female, in no acute distress, afebrile.
HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short."].toDS.toDF("text")).transform(data)
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline.
```bash
+-------------------+----------------------+
|chunk |ner |
+-------------------+----------------------+
|skin |Organ |
|Extraocular muscles|Organ |
|turbinates |Multi-tissue_structure|
|Mucous membranes |Tissue |
|Neck |Organism_subdivision |
|bowel |Organ |
|skin |Organ |
+-------------------+----------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_anatomy_en_2.4.2_2.4|
|Type:|ner|
|Compatibility:|Spark NLP 2.4.2|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence,token, embeddings]|
|Output Labels:|[ner]|
|Language:|[en]|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on the Anatomical Entity Mention (AnEM) corpus with ``'embeddings_clinical'``.
http://www.nactem.ac.uk/anatomy/
{:.h2_title}
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|----------------------------------:|-----:|-----:|-----:|---------:|---------:|---------:|
| 0 | B-Immaterial_anatomical_entity | 4 | 0 | 1 | 1 | 0.8 | 0.888889 |
| 1 | B-Cellular_component | 14 | 4 | 7 | 0.777778 | 0.666667 | 0.717949 |
| 2 | B-Organism_subdivision | 21 | 7 | 3 | 0.75 | 0.875 | 0.807692 |
| 3 | I-Cell | 47 | 8 | 5 | 0.854545 | 0.903846 | 0.878505 |
| 4 | B-Tissue | 14 | 2 | 10 | 0.875 | 0.583333 | 0.7 |
| 5 | B-Anatomical_system | 5 | 1 | 3 | 0.833333 | 0.625 | 0.714286 |
| 6 | B-Organism_substance | 26 | 2 | 8 | 0.928571 | 0.764706 | 0.83871 |
| 7 | B-Cell | 86 | 6 | 11 | 0.934783 | 0.886598 | 0.910053 |
| 8 | I-Immaterial_anatomical_entity | 5 | 0 | 0 | 1 | 1 | 1 |
| 9 | I-Tissue | 16 | 1 | 6 | 0.941176 | 0.727273 | 0.820513 |
| 10 | I-Pathological_formation | 20 | 0 | 1 | 1 | 0.952381 | 0.97561 |
| 11 | I-Anatomical_system | 7 | 0 | 0 | 1 | 1 | 1 |
| 12 | B-Organ | 30 | 7 | 3 | 0.810811 | 0.909091 | 0.857143 |
| 13 | B-Pathological_formation | 35 | 5 | 3 | 0.875 | 0.921053 | 0.897436 |
| 14 | I-Cellular_component | 4 | 0 | 3 | 1 | 0.571429 | 0.727273 |
| 15 | I-Multi-tissue_structure | 26 | 10 | 6 | 0.722222 | 0.8125 | 0.764706 |
| 16 | B-Multi-tissue_structure | 57 | 23 | 8 | 0.7125 | 0.876923 | 0.786207 |
| 17 | I-Organism_substance | 6 | 2 | 0 | 0.75 | 1 | 0.857143 |
| 18 | Macro-average | 424 | 84 | 88 | 0.731775 | 0.682666 | 0.706368 |
| 19 | Micro-average | 424 | 84 | 88 | 0.834646 | 0.828125 | 0.831372 |
```
---
layout: model
title: Clinical Deidentification Pipeline (Portuguese)
author: John Snow Labs
name: clinical_deidentification
date: 2022-06-21
tags: [deid, deidentification, pt, licensed]
task: [De-identification, Pipeline Healthcare]
language: pt
edition: Healthcare NLP 3.5.0
spark_version: 3.0
supported: true
recommended: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline is trained with `w2v_cc_300d` portuguese embeddings and can be used to deidentify PHI information from medical texts in Spanish. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `EMAIL`, `ID`, `COUNTRY`, `STREET`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_pt_3.5.0_3.0_1655820388743.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_pt_3.5.0_3.0_1655820388743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "pt", "clinical/models")
sample = """Dados do paciente.
Nome: Mauro.
Apelido: Gonçalves.
NIF: 368503.
NISS: 26 63514095.
Endereço: Calle Miguel Benitez 90.
CÓDIGO POSTAL: 28016.
Dados de cuidados.
Data de nascimento: 03/03/1946.
País: Portugal.
Idade: 70 anos Sexo: M.
Data de admissão: 12/12/2016.
Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973.
Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia.
Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal.
O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV.
A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal.
Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml.
A citologia da urina era repetidamente desconfiada por malignidade.
A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis.
A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso.
O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal.
A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga.
Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com.
"""
result = deid_pipeline .annotate(sample)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "pt", "clinical/models")
sample = "Dados do paciente.
Nome: Mauro.
Apelido: Gonçalves.
NIF: 368503.
NISS: 26 63514095.
Endereço: Calle Miguel Benitez 90.
CÓDIGO POSTAL: 28016.
Dados de cuidados.
Data de nascimento: 03/03/1946.
País: Portugal.
Idade: 70 anos Sexo: M.
Data de admissão: 12/12/2016.
Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973.
Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia.
Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal.
O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV.
A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal.
Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml.
A citologia da urina era repetidamente desconfiada por malignidade.
A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis.
A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso.
O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal.
A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga.
Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com"
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("pt.deid.clinical").predict("""Dados do paciente.
Nome: Mauro.
Apelido: Gonçalves.
NIF: 368503.
NISS: 26 63514095.
Endereço: Calle Miguel Benitez 90.
CÓDIGO POSTAL: 28016.
Dados de cuidados.
Data de nascimento: 03/03/1946.
País: Portugal.
Idade: 70 anos Sexo: M.
Data de admissão: 12/12/2016.
Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973.
Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia.
Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal.
O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV.
A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal.
Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml.
A citologia da urina era repetidamente desconfiada por malignidade.
A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis.
A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso.
O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal.
A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga.
Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com.
""")
```
## Results
```bash
Masked with entity labels
------------------------------
Dados do .
Nome: .
Apelido: .
NIF: .
NISS: .
Endereço: .
CÓDIGO POSTAL: .
Dados de cuidados.
Data de nascimento: .
País: .
Idade: anos Sexo: .
Data de admissão: .
Doutor: Cuéllar NºCol: .
Relatório clínico do : de anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda;
Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia.
Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal.
O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV.
A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal.
Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml.
A citologia da urina era repetidamente desconfiada por malignidade.
A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis.
A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso.
O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress.
A tomografia computorizada abdominal é normal.
A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga.
Referido por: - , 22 E-mail: .
Masked with chars
------------------------------
Dados do [******].
Nome: [***].
Apelido: [*******].
NIF: [****].
NISS: [*********].
Endereço: [*********************].
CÓDIGO POSTAL: [***].
Dados de cuidados.
Data de nascimento: [********].
País: [******].
Idade: ** anos Sexo: *.
Data de admissão: [********].
Doutor: [*************] Cuéllar NºCol: ** ** [***].
Relatório clínico do [******]: [******] de ** anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda;
Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia.
Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal.
O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV.
A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal.
Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér[**] de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml.
A citologia da urina era repetidamente desconfiada por malignidade.
A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis.
A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso.
O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress.
A tomografia computorizada abdominal é normal.
A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga.
Referido por: [***********] - [*****************], 22 [******] E-mail: [****************].
Masked with fixed length chars
------------------------------
Dados do ****.
Nome: ****.
Apelido: ****.
NIF: ****.
NISS: ****.
Endereço: ****.
CÓDIGO POSTAL: ****.
Dados de cuidados.
Data de nascimento: ****.
País: ****.
Idade: **** anos Sexo: ****.
Data de admissão: ****.
Doutor: **** Cuéllar NºCol: **** **** ****.
Relatório clínico do ****: **** de **** anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda;
Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia.
Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal.
O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV.
A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal.
Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér**** de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml.
A citologia da urina era repetidamente desconfiada por malignidade.
A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis.
A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso.
O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress.
A tomografia computorizada abdominal é normal.
A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga.
Referido por: **** - ****, 22 **** E-mail: ****.
Obfuscated
------------------------------
Dados do H..
Nome: Marcos Alves.
Apelido: Tiago Santos.
NIF: 566-445.
NISS: 134544332.
Endereço: Rua de Santa María, 100.
CÓDIGO POSTAL: 4099.
Dados de cuidados.
Data de nascimento: 31/03/1946.
País: Espanha.
Idade: 46 anos Sexo: Mulher.
Data de admissão: 06/01/2017.
Doutor: Carlos Melo Cuéllar NºCol: 134544332 134544332 124 445 311.
Relatório clínico do H.: M. de 46 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda;
Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia.
Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal.
O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV.
A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal.
Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicérHomen de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml.
A citologia da urina era repetidamente desconfiada por malignidade.
A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis.
A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso.
O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress.
A tomografia computorizada abdominal é normal.
A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga.
Referido por: Carlos Melo - Avenida Dos Aliados, 56, 22 Espanha E-mail: maria.prado@jacob.com.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.5.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|pt|
|Size:|1.3 GB|
## Included Models
- nlp.DocumentAssembler
- nlp.SentenceDetectorDLModel
- nlp.TokenizerModel
- nlp.WordEmbeddingsModel
- medical.NerModel
- nlp.NerConverter
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- nlp.TextMatcherModel
- ContextualParserModel
- ContextualParserModel
- nlp.RegexMatcherModel
- nlp.RegexMatcherModel
- ChunkMergeModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- Finisher
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77 TFWav2Vec2ForCTC from emeson77
author: John Snow Labs
name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77` is a English model originally trained by emeson77.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_en_4.2.0_3.0_1664037180334.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_en_4.2.0_3.0_1664037180334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Legal Food Technology Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_food_technology_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, food_technology, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_food_technology_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Food_Technology or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Food_Technology`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_food_technology_bert_en_1.0.0_3.0_1678111847548.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_food_technology_bert_en_1.0.0_3.0_1678111847548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Food_Technology]|
|[Other]|
|[Other]|
|[Food_Technology]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_food_technology_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.8 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Food_Technology 0.87 0.86 0.86 221
Other 0.83 0.84 0.84 181
accuracy - - 0.85 402
macro-avg 0.85 0.85 0.85 402
weighted-avg 0.85 0.85 0.85 402
```
---
layout: model
title: Vaccine Sentiment Classifier (BioBERT)
author: John Snow Labs
name: bert_sequence_classifier_vaccine_sentiment
date: 2022-07-28
tags: [public_health, vaccine_sentiment, en, licensed, sequence_classification]
task: Sentiment Analysis
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
recommended: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [BioBERT](https://nlp.johnsnowlabs.com/2022/07/18/biobert_pubmed_base_cased_v1.2_en_3_0.html) based sentimental analysis model that can extract information from COVID-19 Vaccine-related tweets. The model predicts whether a tweet contains positive, negative, or neutral sentiments about COVID-19 Vaccines.
## Predicted Entities
`neutral`, `positive`, `negative`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_VACCINE_STATUS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vaccine_sentiment_en_4.0.0_3.0_1658995472179.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vaccine_sentiment_en_4.0.0_3.0_1658995472179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vaccine_sentiment", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
text_list = ['A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.',
'People with a history of severe allergic reaction to any component of the vaccine should not take.',
'43 million doses of vaccines administrated worldwide...Production capacity of CHINA to reach 4 b']
data = spark.createDataFrame(text_list, StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vaccine_sentiment", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq(Array("A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.",
"People with a history of severe allergic reaction to any component of the vaccine should not take.",
"43 million doses of vaccines administrated worldwide...Production capacity of CHINA to reach 4 b")).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.bert_sequence_vaccine_sentiment").predict("""A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.""")
```
## Results
```bash
+-----------------------------------------------------------------------------------------------------+----------+
|text |class |
+-----------------------------------------------------------------------------------------------------+----------+
|A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.|[positive]|
|People with a history of severe allergic reaction to any component of the vaccine should not take. |[negative]|
|43 million doses of vaccines administrated worldwide...Production capacity of CHINA to reach 4 b |[neutral] |
+-----------------------------------------------------------------------------------------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_vaccine_sentiment|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
Curated from several academic and in-house datasets.
## Benchmarking
```bash
label precision recall f1-score support
neutral 0.82 0.78 0.80 1007
positive 0.88 0.90 0.89 1002
negative 0.83 0.86 0.84 881
accuracy - - 0.85 2890
macro-avg 0.85 0.85 0.85 2890
weighted-avg 0.85 0.85 0.85 2890
```
---
layout: model
title: English BertForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_only_classfn_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191247552.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191247552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/rule_based_only_classfn_epochs_1_shard_1_squad2.0
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from Leizhang)
author: John Snow Labs
name: xlmroberta_ner_leizhang_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Leizhang`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_leizhang_base_finetuned_panx_de_4.1.0_3.0_1660429672416.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_leizhang_base_finetuned_panx_de_4.1.0_3.0_1660429672416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_leizhang_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_leizhang_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_leizhang_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Leizhang/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: English asr_wav2vec2_xls_r_300m_Turkish_Tr_med TFWav2Vec2ForCTC from emre
author: John Snow Labs
name: pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_Turkish_Tr_med` is a English model originally trained by emre.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med_en_4.2.0_3.0_1664037839645.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med_en_4.2.0_3.0_1664037839645.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Translate English to Salishan languages Pipeline
author: John Snow Labs
name: translate_en_sal
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, sal, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `sal`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sal_xx_2.7.0_2.4_1609686644399.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sal_xx_2.7.0_2.4_1609686644399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_sal", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_sal", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.sal').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_sal|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Seniority Clause Binary Classifier
author: John Snow Labs
name: legclf_seniority_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `seniority` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `seniority`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_seniority_clause_en_1.0.0_3.2_1660123984376.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_seniority_clause_en_1.0.0_3.2_1660123984376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[seniority]|
|[other]|
|[other]|
|[seniority]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_seniority_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.99 0.96 0.97 91
seniority 0.90 0.97 0.94 37
accuracy - - 0.96 128
macro-avg 0.94 0.96 0.95 128
weighted-avg 0.96 0.96 0.96 128
```
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465522
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465522` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465522_en_4.0.0_3.0_1655986870269.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465522_en_4.0.0_3.0_1655986870269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465522","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465522","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465522.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465522|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|887.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465522
---
layout: model
title: English RobertaForQuestionAnswering (from LucasS)
author: John Snow Labs
name: roberta_qa_robertaBaseABSA
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robertaBaseABSA` is a English model originally trained by `LucasS`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_robertaBaseABSA_en_4.0.0_3.0_1655738733590.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_robertaBaseABSA_en_4.0.0_3.0_1655738733590.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertaBaseABSA","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_robertaBaseABSA","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_robertaBaseABSA|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|436.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/LucasS/robertaBaseABSA
---
layout: model
title: English BertForQuestionAnswering model (from horsbug98)
author: John Snow Labs
name: bert_qa_Part_2_mBERT_Model_E2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_2_mBERT_Model_E2` is a English model orginally trained by `horsbug98`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Part_2_mBERT_Model_E2_en_4.0.0_3.0_1654178989285.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Part_2_mBERT_Model_E2_en_4.0.0_3.0_1654178989285.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Part_2_mBERT_Model_E2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_Part_2_mBERT_Model_E2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.tydiqa.multi_lingual_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_Part_2_mBERT_Model_E2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/horsbug98/Part_2_mBERT_Model_E2
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from comacrae)
author: John Snow Labs
name: roberta_qa_eda_and_parav3
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-eda-and-parav3` is a English model originally trained by `comacrae`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_eda_and_parav3_en_4.3.0_3.0_1674220091398.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_eda_and_parav3_en_4.3.0_3.0_1674220091398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_eda_and_parav3","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_eda_and_parav3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_eda_and_parav3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/comacrae/roberta-eda-and-parav3
---
layout: model
title: English image_classifier_vit_dog_food__base_patch16_224_in21k ViTForImageClassification from sasha
author: John Snow Labs
name: image_classifier_vit_dog_food__base_patch16_224_in21k
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_dog_food__base_patch16_224_in21k` is a English model originally trained by sasha.
## Predicted Entities
`dog`, `food`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_food__base_patch16_224_in21k_en_4.1.0_3.0_1660171837844.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_food__base_patch16_224_in21k_en_4.1.0_3.0_1660171837844.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_dog_food__base_patch16_224_in21k", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_dog_food__base_patch16_224_in21k", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_dog_food__base_patch16_224_in21k|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Sentence Embeddings - sbert medium (tuned)
author: John Snow Labs
name: sbert_jsl_medium_rxnorm_uncased
date: 2022-01-03
tags: [embeddings, clinical, licensed, en]
task: Embeddings
language: en
nav_key: models
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps sentences & documents to a 512-dimensional dense vector space by using average pooling on top of BERT model. It's also fine-tuned on the RxNorm dataset to help generalization over medication-related datasets.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_rxnorm_uncased_en_3.3.4_2.4_1641241051941.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_rxnorm_uncased_en_3.3.4_2.4_1641241051941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sentence_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_rxnorm_uncased", "en", "clinical/models")\
.setInputCols("sentence")\
.setOutputCol("sbert_embeddings")
```
```scala
val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_rxnorm_uncased", "en", "clinical/models")
.setInputCols("sentence")
.setOutputCol("sbert_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed_sentence.bert_uncased.rxnorm").predict("""Put your text here.""")
```
## Results
```bash
Gives a 512-dimensional vector representation of the sentence.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbert_jsl_medium_rxnorm_uncased|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|153.9 MB|
|Case sensitive:|false|
---
layout: model
title: English asr_wav2vec2_large_a TFWav2Vec2ForCTC from yongjian
author: John Snow Labs
name: asr_wav2vec2_large_a
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_a` is a English model originally trained by yongjian.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_a_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_a_en_4.2.0_3.0_1664039889685.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_a_en_4.2.0_3.0_1664039889685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_a", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_a", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_a|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_10_h_512
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-10_H-512` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_512_zh_4.2.4_3.0_1670021484393.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_512_zh_4.2.4_3.0_1670021484393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_512","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_512","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_10_h_512|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|161.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-10_H-512
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Detect Assertion Status (assertion_jsl_augmented)
author: John Snow Labs
name: assertion_jsl_augmented
date: 2022-09-15
tags: [licensed, clinical, assertion, en]
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.0
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The deep neural network architecture for assertion status detection in Spark NLP is based on a BiLSTM framework, and it is a modified version of the architecture proposed by Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances, hypothetically present in the patient at some future point, and mentioned in the patient report but associated with someoneelse (Uzuner et al. 2011). This model is also the augmented version of [assertion_jsl](https://nlp.johnsnowlabs.com/2021/07/24/assertion_jsl_en.html) model with in-house annotations and it returns confidence scores of the results.
## Predicted Entities
`Present`, `Absent`, `Possible`, `Planned`, `Past`, `Family`, `Hypotetical`, `SomeoneElse`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_jsl_augmented_en_4.1.0_3.0_1663252918565.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_jsl_augmented_en_4.1.0_3.0_1663252918565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")\
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setBlackList(["RelativeDate", "Gender"])
clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
clinical_assertion
])
text = """Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She was bullied by her boss and got antidepressant. We prescribed sleeping pills for her current insomnia"""
data = spark.createDataFrame([[text]]).toDF('text')
result = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setBlackList(Array("RelativeDate", "Gender"))
val clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val nlpPipeline = Pipeline().setStages(Array(documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
clinical_assertion))
val data= Seq("Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She was bullied by her boss and got antidepressant. We prescribed sleeping pills for her current insomnia").toDS.toDF("text")
val result = nlpPipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert.jsl_augmented").predict("""Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She was bullied by her boss and got antidepressant. We prescribed sleeping pills for her current insomnia""")
```
## Results
```bash
+--------------+-----+---+-------------------------+-----------+---------+
|ner_chunk |begin|end|ner_label |sentence_id|assertion|
+--------------+-----+---+-------------------------+-----------+---------+
|headache |14 |21 |Symptom |0 |Past |
|anxious |57 |63 |Symptom |0 |Possible |
|alopecia |89 |96 |Disease_Syndrome_Disorder|1 |Absent |
|pain |116 |119|Symptom |2 |Absent |
|paralyzed |136 |144|Symptom |3 |Family |
|antidepressant|212 |225|Drug_Ingredient |4 |Past |
|sleeping pills|242 |255|Drug_Ingredient |5 |Planned |
|insomnia |273 |280|Symptom |5 |Present |
+--------------+-----+---+-------------------------+-----------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|assertion_jsl_augmented|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, chunk, embeddings]|
|Output Labels:|[assertion]|
|Language:|en|
|Size:|6.5 MB|
## Benchmarking
```bash
label precision recall f1-score
Absent 0.94 0.93 0.94
Family 0.88 0.91 0.89
Hypothetical 0.85 0.82 0.83
Past 0.89 0.89 0.89
Planned 0.78 0.81 0.80
Possible 0.82 0.82 0.82
Present 0.91 0.93 0.92
SomeoneElse 0.88 0.80 0.84
```
---
layout: model
title: Explain Document pipeline for Swedish (explain_document_lg)
author: John Snow Labs
name: explain_document_lg
date: 2021-03-23
tags: [open_source, swedish, explain_document_lg, pipeline, sv]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: sv
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_sv_3.0.0_3.0_1616520973696.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_sv_3.0.0_3.0_1616520973696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_lg', lang = 'sv')
annotations = pipeline.fullAnnotate(""Hej från John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_lg", lang = "sv")
val result = pipeline.fullAnnotate("Hej från John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hej från John Snow Labs! ""]
result_df = nlu.load('sv.explain.lg').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:------------------------------|:-----------------------------|:-----------------------------------------|:-----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Hej från John Snow Labs! '] | ['Hej från John Snow Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0306969992816448,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|sv|
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Lolaibrin)
author: John Snow Labs
name: distilbert_qa_lolaibrin_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Lolaibrin`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_lolaibrin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768653496.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_lolaibrin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768653496.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lolaibrin_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lolaibrin_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_lolaibrin_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Lolaibrin/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Longformer Base NER Pipeline
author: ahmedlone127
name: longformer_base_token_classifier_conll03_pipeline
date: 2022-06-14
tags: [ner, longformer, pipeline, conll, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: false
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [longformer_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/09/longformer_base_token_classifier_conll03_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/longformer_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655213912525.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/longformer_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655213912525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("longformer_base_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I am working at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("longformer_base_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I am working at John Snow Labs.")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PER |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|longformer_base_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Community|
|Language:|en|
|Size:|516.0 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- LongformerForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from huxxx657)
author: John Snow Labs
name: roberta_qa_base_finetuned_scrambled_squad_15_new
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-15-new` is a English model originally trained by `huxxx657`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_15_new_en_4.3.0_3.0_1674216883682.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_15_new_en_4.3.0_3.0_1674216883682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_15_new","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_15_new","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_finetuned_scrambled_squad_15_new|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-15-new
---
layout: model
title: Fast Neural Machine Translation Model from English to Kwangali
author: John Snow Labs
name: opus_mt_en_kwn
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, kwn, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `kwn`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_kwn_xx_2.7.0_2.4_1609164344098.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_kwn_xx_2.7.0_2.4_1609164344098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_kwn", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_kwn", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.kwn').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_kwn|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Hindi Named Entity Recognition (from sagorsarker)
author: John Snow Labs
name: bert_ner_codeswitch_hineng_lid_lince
date: 2022-05-09
tags: [bert, ner, token_classification, hi, open_source]
task: Named Entity Recognition
language: hi
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `codeswitch-hineng-lid-lince` is a Hindi model orginally trained by `sagorsarker`.
## Predicted Entities
`mixed`, `hin`, `other`, `unk`, `en`, `ambiguous`, `ne`, `fw`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_hineng_lid_lince_hi_3.4.2_3.0_1652097632881.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_hineng_lid_lince_hi_3.4.2_3.0_1652097632881.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_hineng_lid_lince","hi") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["मुझे स्पार्क एनएलपी बहुत पसंद है"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_hineng_lid_lince","hi")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("मुझे स्पार्क एनएलपी बहुत पसंद है").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_codeswitch_hineng_lid_lince|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|hi|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/sagorsarker/codeswitch-hineng-lid-lince
- https://ritual.uh.edu/lince/home
- https://github.com/sagorbrur/codeswitch
---
layout: model
title: Pipeline to Detect Clinical Entities (BertForTokenClassifier)
author: John Snow Labs
name: bert_token_classifier_ner_jsl_pipeline
date: 2023-03-20
tags: [ner_jsl, ner, berfortokenclassification, en, licensed]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_jsl](https://nlp.johnsnowlabs.com/2022/03/21/bert_token_classifier_ner_jsl_en_2_4.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_4.3.0_3.2_1679305183990.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_4.3.0_3.2_1679305183990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models")
text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models")
val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.bert_token_ner_jsl.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_chinese_pert_large_open_domain_mrc","zh") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_chinese_pert_large_open_domain_mrc","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.answer_question.bert.large").predict("""PUT YOUR QUESTION HERE|||"PUT YOUR CONTEXT HERE""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_chinese_pert_large_open_domain_mrc|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|zh|
|Size:|1.2 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/qalover/chinese-pert-large-open-domain-mrc
- https://github.com/dbiir/UER-py/
---
layout: model
title: Sentence Entity Resolver for UMLS CUI Codes
author: John Snow Labs
name: sbiobertresolve_umls_major_concepts
date: 2021-10-03
tags: [entity_resolution, licensed, clinical, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.2.3
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts to 4 major categories of UMLS CUI codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It has faster load time, with a speedup of about 6X when compared to previous versions.
## Predicted Entities
`This model returns CUI (concept unique identifier) codes for Clinical Findings`, `Medical Devices`, `Anatomical Structures and Injuries & Poisoning terms`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_major_concepts_en_3.2.3_3.0_1633221571574.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_major_concepts_en_3.2.3_3.0_1633221571574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_umls_major_concepts``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Cerebrovascular_Disease, Communicable_Disease, Diabetes, Disease_Syndrome_Disorder, Heart_Disease, Hyperlipidemia, Hypertension, Injury_or_Poisoning, Kidney_Disease, Medical-Device, Obesity, Oncological, Overweight, Psychological_Condition, Symptom, VS_Finding, ImagingFindings, EKG_Findings``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_umls_major_concepts","en", "clinical/models") \
.setInputCols(["ner_chunk_doc", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver])
data = spark.createDataFrame([["The patient complains of ankle pain after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician"]]).toDF("text")
results = pipeline.fit(data).transform(data)
```
```scala
...
val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli", "en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_umls_major_concepts", "en", "clinical/models")
.setInputCols(Array("ner_chunk_doc", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val p_model = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver))
val data = Seq(""The patient complains of ankle pain after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician"").toDF("text")
val res = p_model.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.umls").predict("""The patient complains of ankle pain after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician""")
```
## Results
```bash
| | ner_chunk | code | code_description |
|---:|:------------------------------|:-------------|:---------------------------------------------|
| 0 | ankle | C4047548 | bilateral ankle joint pain (finding) |
| 1 | falling from stairs | C0417023 | fall from stairs |
| 2 | Arthroscopy | C0179144 | arthroscope |
| 3 | primary care pyhsician | C3266804 | referred by primary care physician (finding) |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_umls_major_concepts|
|Compatibility:|Healthcare NLP 3.2.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_chunk_embeddings]|
|Output Labels:|[umls_code]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on data sampled from https://www.nlm.nih.gov/research/umls/index.html
---
layout: model
title: BioBERT Sentence Embeddings (PMC)
author: John Snow Labs
name: sent_biobert_pmc_base_cased
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pmc_base_cased_en_2.6.0_2.4_1598348966950.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pmc_base_cased_en_2.6.0_2.4_1598348966950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pmc_base_cased", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pmc_base_cased", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.biobert.pmc_base_cased').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
sentence en_embed_sentence_biobert_pmc_base_cased_embeddings
I hate cancer [0.34035101532936096, 0.04413360357284546, -0....
Antibiotics aren't painkiller [0.4397204518318176, 0.066007100045681, -0.114...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_biobert_pmc_base_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert)
---
layout: model
title: Explain Document Pipeline for Russian
author: John Snow Labs
name: explain_document_sm
date: 2021-03-22
tags: [open_source, russian, explain_document_sm, pipeline, ru]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: ru
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_ru_3.0.0_3.0_1616422668270.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_ru_3.0.0_3.0_1616422668270.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_sm', lang = 'ru')
annotations = pipeline.fullAnnotate(""Здравствуйте из Джона Снежных Лабораторий! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_sm", lang = "ru")
val result = pipeline.fullAnnotate("Здравствуйте из Джона Снежных Лабораторий! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Здравствуйте из Джона Снежных Лабораторий! ""]
result_df = nlu.load('ru.explain').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:------------------------------------------------|:-----------------------------------------------|:-----------------------------------------------------------|:-----------------------------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:-------------------------------|
| 0 | ['Здравствуйте из Джона Снежных Лабораторий! '] | ['Здравствуйте из Джона Снежных Лабораторий!'] | ['Здравствуйте', 'из', 'Джона', 'Снежных', 'Лабораторий!'] | ['здравствовать', 'из', 'Джон', 'Снежных', 'Лабораторий!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['Джона Снежных Лабораторий!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_sm|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|ru|
---
layout: model
title: Greek BertForMaskedLM Base Uncased model (from gealexandri)
author: John Snow Labs
name: bert_embeddings_greeksocial_base_greek_uncased_v1
date: 2022-12-06
tags: [el, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: el
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `greeksocialbert-base-greek-uncased-v1` is a Greek model originally trained by `gealexandri`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_greeksocial_base_greek_uncased_v1_el_4.2.4_3.0_1670326520370.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_greeksocial_base_greek_uncased_v1_el_4.2.4_3.0_1670326520370.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_greeksocial_base_greek_uncased_v1","el") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_greeksocial_base_greek_uncased_v1","el")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_greeksocial_base_greek_uncased_v1|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|el|
|Size:|424.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/gealexandri/greeksocialbert-base-greek-uncased-v1
- http://www.paloservices.com/
---
layout: model
title: Contextual SpellChecker Clinical
author: John Snow Labs
name: spellcheck_clinical
class: ContextSpellCheckerModel
language: en
nav_key: models
repository: clinical/models
date: 2020-04-17
task: Spell Check
edition: Healthcare NLP 2.4.2
spark_version: 2.4
tags: [clinical,licensed,en]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Implements Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/6.Clinical_Context_Spell_Checker.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_2.4.2_2.4_1587146727460.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_2.4.2_2.4_1587146727460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
model = ContextSpellCheckerModel.pretrained("spellcheck_clinical","en","clinical/models")
.setInputCols("token")
.setOutputCol("spell")
```
```scala
val model = ContextSpellCheckerModel.pretrained("spellcheck_clinical","en","clinical/models")
.setInputCols("token")
.setOutputCol("spell")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.spell.clinical").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---------------|--------------------------|
| Name: | spellcheck_clinical |
| Type: | ContextSpellCheckerModel |
| Compatibility: | 2.4.2 |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [token] |
|Output labels: | [spell] |
| Language: | en |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained with PubMed and i2b2 datasets.
---
layout: model
title: English BertForQuestionAnswering Base Cased model (from rsvp-ai)
author: John Snow Labs
name: bert_qa_bertserini_base_cmrc
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertserini-bert-base-cmrc` is a English model originally trained by `rsvp-ai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bertserini_base_cmrc_en_4.0.0_3.0_1657188963909.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bertserini_base_cmrc_en_4.0.0_3.0_1657188963909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bertserini_base_cmrc","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_bertserini_base_cmrc","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bertserini_base_cmrc|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|381.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/rsvp-ai/bertserini-bert-base-cmrc
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from lingchensanwen)
author: John Snow Labs
name: distilbert_qa_lingchensanwen_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `lingchensanwen`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_lingchensanwen_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771975946.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_lingchensanwen_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771975946.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lingchensanwen_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lingchensanwen_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_lingchensanwen_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/lingchensanwen/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English BertForQuestionAnswering model (from juliusco)
author: John Snow Labs
name: bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert-base-cased-v1.1-squad-finetuned-biobert` is a English model orginally trained by `juliusco`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert_en_4.0.0_3.0_1654185597741.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert_en_4.0.0_3.0_1654185597741.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.biobert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|403.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/juliusco/biobert-base-cased-v1.1-squad-finetuned-biobert
---
layout: model
title: Clinical Deidentification (Spanish)
author: John Snow Labs
name: clinical_deidentification
date: 2022-03-02
tags: [deid, es, licensed]
task: De-identification
language: es
edition: Healthcare NLP 3.4.1
spark_version: 2.4
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline is trained with sciwiki_300d embeddings and can be used to deidentify PHI information from medical texts in Spanish. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `E-MAIL`, `USERNAME`, `LOCATION`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `MEDICALRECORD`, `IDNUM`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_3.4.1_2.4_1646246697330.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_3.4.1_2.4_1646246697330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from johnsnowlabs import *
deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models")
sample = """Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
"""
result = deid_pipeline .annotate(sample)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "es", "clinical/models")
sample = "Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
"
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.deid.clinical").predict("""Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
""")
```
## Results
```bash
Masked with entity labels
------------------------------
Datos del paciente.
Nombre: .
Apellidos: .
NHC: .
NASS: 04.
Domicilio: , 5 B..
Localidad/ Provincia: .
CP: .
Datos asistenciales.
Fecha de nacimiento: .
País: .
Edad: años Sexo: .
Fecha de Ingreso: .
: María Merino Viveros NºCol: .
Informe clínico del paciente: de años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
Servicio de Endocrinología y Nutrición km 12,500 28905 - () Correo electrónico:
Masked with chars
------------------------------
Datos del paciente.
Nombre: [**] .
Apellidos: [*************].
NHC: [*****].
NASS: ** [******] 04.
Domicilio: [*******************], 5 B..
Localidad/ Provincia: [****].
CP: [***].
Datos asistenciales.
Fecha de nacimiento: [********].
País: [****].
Edad: ** años Sexo: *.
Fecha de Ingreso: [********].
[****]: María Merino Viveros NºCol: ** ** [***].
Informe clínico del paciente: [***] de ** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en [*********] en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: [*************] +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente [****] significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
[******************] [******************************] Servicio de Endocrinología y Nutrición [*****************] km 12,500 28905 [****] - [****] ([****]) Correo electrónico: [********************]
Masked with fixed length chars
------------------------------
Datos del paciente.
Nombre: **** .
Apellidos: ****.
NHC: ****.
NASS: **** **** 04.
Domicilio: ****, 5 B..
Localidad/ Provincia: ****.
CP: ****.
Datos asistenciales.
Fecha de nacimiento: ****.
País: ****.
Edad: **** años Sexo: ****.
Fecha de Ingreso: ****.
****: María Merino Viveros NºCol: **** **** ****.
Informe clínico del paciente: **** de **** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en **** en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: **** +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente **** significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
**** **** Servicio de Endocrinología y Nutrición **** km 12,500 28905 **** - **** (****) Correo electrónico: ****
Obfuscated
------------------------------
Datos del paciente.
Nombre: Sr. Lerma .
Apellidos: Aristides Gonzalez Gelabert.
NHC: BBBBBBBBQR648597.
NASS: 041010000011 RZRM020101906017 04.
Domicilio: Valencia, 5 B..
Localidad/ Provincia: Madrid.
CP: 99335.
Datos asistenciales.
Fecha de nacimiento: 25/04/1977.
País: Barcelona.
Edad: 8 años Sexo: F..
Fecha de Ingreso: 02/08/2018.
transportista: María Merino Viveros NºCol: olegario10 olegario10 felisa78.
Informe clínico del paciente: RZRM020101906017 de 8 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en Madrid en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Dra. Laguna +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente 041010000011 significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
Reinaldo Manjón Malo Barcelona Servicio de Endocrinología y Nutrición Valencia km 12,500 28905 Bilbao - Madrid (Barcelona) Correo electrónico: quintanasalome@example.net
```
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|es|
|Size:|281.2 MB|
## Included Models
- nlp.DocumentAssembler
- nlp.SentenceDetectorDLModel
- nlp.TokenizerModel
- nlp.WordEmbeddingsModel
- medical.NerModel
- nlp.NerConverter
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- Finisher
---
layout: model
title: Arabic Bert Embeddings (Base, Arabert Model, v01)
author: John Snow Labs
name: bert_embeddings_bert_base_arabertv01
date: 2022-04-11
tags: [bert, embeddings, ar, open_source]
task: Embeddings
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabertv01` is a Arabic model orginally trained by `aubmindlab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabertv01_ar_3.4.2_3.0_1649677579686.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabertv01_ar_3.4.2_3.0_1649677579686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabertv01","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabertv01","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.bert_base_arabertv01").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_arabertv01|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|508.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/aubmindlab/bert-base-arabertv01
- https://github.com/google-research/bert
- https://arxiv.org/abs/2003.00104
- https://github.com/WissamAntoun/pydata_khobar_meetup
- http://alt.qcri.org/farasa/segmenter.html
- /aubmindlab/bert-base-arabertv01/resolve/main/(https://github.com/google-research/bert/blob/master/multilingual.md)
- https://github.com/elnagara/HARD-Arabic-Dataset
- https://www.aclweb.org/anthology/D15-1299
- https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf
- https://github.com/mohamedadaly/LABR
- http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp
- https://github.com/husseinmozannar/SOQAL
- https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md
- https://arxiv.org/abs/2003.00104v2
- https://archive.org/details/arwiki-20190201
- https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4
- https://www.aclweb.org/anthology/W19-4619
- https://sites.aub.edu.lb/mindlab/
- https://www.yakshof.com/#/
- https://www.behance.net/rahalhabib
- https://www.linkedin.com/in/wissam-antoun-622142b4/
- https://twitter.com/wissam_antoun
- https://github.com/WissamAntoun
- https://www.linkedin.com/in/fadybaly/
- https://twitter.com/fadybaly
- https://github.com/fadybaly
---
layout: model
title: English RobertaForQuestionAnswering (from mbartolo)
author: John Snow Labs
name: roberta_qa_roberta_large_synqa_ext
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-synqa-ext` is a English model originally trained by `mbartolo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_synqa_ext_en_4.0.0_3.0_1655738082187.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_synqa_ext_en_4.0.0_3.0_1655738082187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_synqa_ext","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_large_synqa_ext","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.synqa_ext.roberta.large.by_mbartolo").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_large_synqa_ext|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mbartolo/roberta-large-synqa-ext
- https://arxiv.org/abs/2002.00293
- https://arxiv.org/abs/2104.08678
---
layout: model
title: Part of Speech for Latin
author: John Snow Labs
name: pos_ud_llct
date: 2021-03-09
tags: [part_of_speech, open_source, latin, pos_ud_llct, la]
task: Part of Speech Tagging
language: la
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- PUNCT
- ADP
- PROPN
- NOUN
- VERB
- DET
- CCONJ
- PRON
- ADJ
- NUM
- AUX
- SCONJ
- ADV
- PART
- X
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_llct_la_3.0.0_3.0_1615292206384.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_llct_la_3.0.0_3.0_1615292206384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_llct", "la") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['Aequaliter Nubila Labs Ioannes de salve ! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_llct", "la")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("Aequaliter Nubila Labs Ioannes de salve ! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Aequaliter Nubila Labs Ioannes de salve ! ""]
token_df = nlu.load('la.pos').predict(text)
token_df
```
## Results
```bash
token pos
0 Aequaliter PROPN
1 Nubila PROPN
2 Labs ADJ
3 Ioannes NOUN
4 de ADP
5 salve NOUN
6 ! PROPN
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_llct|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|la|
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-10` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10_en_4.0.0_3.0_1654189512725.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10_en_4.0.0_3.0_1654189512725.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.span_bert.base_cased_1024d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|390.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-10
---
layout: model
title: Word2Vec Embeddings in Tatar (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, tt, open_source]
task: Embeddings
language: tt
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tt_3.4.1_3.0_1647462888271.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tt_3.4.1_3.0_1647462888271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tt") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tt")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("tt.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|tt|
|Size:|535.4 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: BERT Embeddings (Base Uncased)
author: John Snow Labs
name: bert_base_uncased
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_uncased_en_2.6.0_2.4_1598340514223.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_uncased_en_2.6.0_2.4_1598340514223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("bert_base_uncased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("bert_base_uncased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_bert_embeddings
I [0.5920650362968445, 0.18827693164348602, 0.12...
love [1.2889715433120728, 0.8475795388221741, 0.720...
NLP [0.21503107249736786, -0.9925870299339294, 1.0...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_uncased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1](https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1)
---
layout: model
title: Czech RobertaForMaskedLM Cased model (from fav-kky)
author: John Snow Labs
name: roberta_embeddings_fernet_news
date: 2022-12-12
tags: [cs, open_source, roberta_embeddings, robertaformaskedlm]
task: Embeddings
language: cs
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `FERNET-News` is a Czech model originally trained by `fav-kky`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fernet_news_cs_4.2.4_3.0_1670858382244.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fernet_news_cs_4.2.4_3.0_1670858382244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_fernet_news","cs") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_fernet_news","cs")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_fernet_news|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|cs|
|Size:|468.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/fav-kky/FERNET-News
- https://arxiv.org/abs/2107.10042
---
layout: model
title: English BertForQuestionAnswering model (from lewtun)
author: John Snow Labs
name: bert_qa_bert_base_uncased_finetuned_squad_v1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad-v1` is a English model orginally trained by `lewtun`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_squad_v1_en_4.0.0_3.0_1654181199655.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_squad_v1_en_4.0.0_3.0_1654181199655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_finetuned_squad_v1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_uncased_finetuned_squad_v1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base_uncased.by_lewtun").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_uncased_finetuned_squad_v1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/lewtun/bert-base-uncased-finetuned-squad-v1
---
layout: model
title: English Named Entity Recognition (from lucifermorninstar011)
author: John Snow Labs
name: distilbert_ner_autotrain_luicfer_company_861827409
date: 2022-05-16
tags: [distilbert, ner, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-luicfer_company-861827409` is a English model orginally trained by `lucifermorninstar011`.
## Predicted Entities
`vocab`, `company`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_luicfer_company_861827409_en_3.4.2_3.0_1652721660864.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_luicfer_company_861827409_en_3.4.2_3.0_1652721660864.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_luicfer_company_861827409","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_luicfer_company_861827409","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_ner_autotrain_luicfer_company_861827409|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/lucifermorninstar011/autotrain-luicfer_company-861827409
---
layout: model
title: Summarize clinical notes (augmented)
author: John Snow Labs
name: summarizer_clinical_jsl_augmented
date: 2023-03-30
tags: [licensed, clinical, en, summarization, tensorflow]
task: Summarization
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MedicalSummarizer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a modified version of Flan-T5 (LLM) based summarization model that is at first finetuned with natural instructions and then finetuned with clinical notes, encounters, critical care notes, discharge notes, reports, curated by John Snow Labs. This model is further optimized by augmenting the training methodology, and dataset. It can generate summaries from clinical notes up to 512 tokens given the input text (max 1024 tokens).
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_SUMMARIZATION/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_augmented_en_4.3.2_3.0_1680203312371.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_augmented_en_4.3.2_3.0_1680203312371.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
summarizer = MedicalSummarizer()\
.pretrained("summarizer_clinical_jsl_augmented", "en", "clinical/models")\
.setInputCols("document")\
.setOutputCol("summary")\
.setMaxTextLength(512)\
.setMaxNewTokens(512)
pipeline = Pipeline(stages=[document, summarizer])
text = """Patient with hypertension, syncope, and spinal stenosis - for recheck.
(Medical Transcription Sample Report)
SUBJECTIVE:
The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.
PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS:
Reviewed and unchanged from the dictation on 12/03/2003.
MEDICATIONS:
Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val summarizer = MedicalSummarizer()
.pretrained("summarizer_clinical_jsl_augmented", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("summary")
.setMaxTextLength(512)
.setMaxNewTokens(512)
val pipeline = new Pipeline().setStages(Array(document, summarizer))
val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck.
(Medical Transcription Sample Report)
SUBJECTIVE:
The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.
PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS:
Reviewed and unchanged from the dictation on 12/03/2003.
MEDICATIONS:
Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash."""
val data = Seq(text).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
A 78-year-old female with hypertension, syncope, and spinal stenosis returns for a recheck. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. Her medications include Atenolol, Premarin, calcium with vitamin D, multivitamin, aspirin, and TriViFlor. She also has Elocon cream and Synalar cream for rash.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|summarizer_clinical_jsl_augmented|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|920.0 MB|
## Benchmarkings
### Benchmark on MtSamples Summarization Dataset :
| model_name | model_size | rouge | bleu | bertscore_precision | bertscore_recall: | bertscore_f1 |
|--|--|--|--|--|--|--|
philschmid/flan-t5-base-samsum | 250M | 0.1919 | 0.1124 | 0.8409 | 0.8964 | 0.8678 |
linydub/bart-large-samsum | 500M | 0.1586 | 0.0732 | 0.8747 | 0.8184 | 0.8456 |
philschmid/bart-large-cnn-samsum | 500M | 0.2170 | 0.1299 | 0.8846 | 0.8436 | 0.8636 |
transformersbook/pegasus-samsum | 500M | 0.1924 | 0.0965 | 0.8920 | 0.8149 | 0.8517 |
summarizer_clinical_jsl | 250M | 0.4836 | 0.4188 | 0.9041 | 0.9374 | 0.9204 |
summarizer_clinical_jsl_augmented | 250M | 0.5119 | 0.4545 | 0.9282 | 0.9526 | 0.9402 |
### Benchmark on MIMIC Summarization Dataset :
| model_name | model_size | rouge | bleu | bertscore_precision | bertscore_recall: | bertscore_f1 |
|--|--|--|--|--|--|--|
philschmid/flan-t5-base-samsum | 250M | 0.1910 | 0.1037 | 0.8708 | 0.9056 | 0.8879 |
linydub/bart-large-samsum | 500M | 0.1252 | 0.0382 | 0.8933 | 0.8440 | 0.8679 |
philschmid/bart-large-cnn-samsum | 500M | 0.1795 | 0.0889 | 0.9172 | 0.8978 | 0.9074 |
transformersbook/pegasus-samsum | 570M | 0.1425 | 0.0582 | 0.9171 | 0.8682 | 0.8920 |
summarizer_clinical_jsl | 250M | 0.395 | 0.2962 | 0.895 | 0.9316 | 0.913 |
summarizer_clinical_jsl_augmented | 250M | 0.3964 | 0.307 | 0.9109 | 0.9452 | 0.9227 |

## References
Trained on in-house curated dataset
---
layout: model
title: Part of Speech for Norwegian Nynorsk
author: John Snow Labs
name: pos_ud_nynorsk
date: 2020-05-05 18:57:00 +0800
task: Part of Speech Tagging
language: nn
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [pos, nn]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_nynorsk_nn_2.5.0_2.4_1588693690964.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_nynorsk_nn_2.5.0_2.4_1588693690964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_nynorsk", "nn") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_nynorsk", "nn")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene."""]
pos_df = nlu.load('nn.pos.ud_nynorsk').predict(text)
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=4, result='NOUN', metadata={'word': 'Annet'}),
Row(annotatorType='pos', begin=6, end=8, result='SCONJ', metadata={'word': 'enn'}),
Row(annotatorType='pos', begin=10, end=10, result='PART', metadata={'word': 'å'}),
Row(annotatorType='pos', begin=12, end=15, result='VERB', metadata={'word': 'være'}),
Row(annotatorType='pos', begin=17, end=22, result='NOUN', metadata={'word': 'kongen'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_nynorsk|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|nn|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Chinese Bert Embeddings (Large, MacBERT)
author: John Snow Labs
name: bert_embeddings_chinese_macbert_large
date: 2022-04-11
tags: [bert, embeddings, zh, open_source]
task: Embeddings
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-macbert-large` is a Chinese model orginally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_macbert_large_zh_3.4.2_3.0_1649669165054.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_macbert_large_zh_3.4.2_3.0_1649669165054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_macbert_large","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_macbert_large","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.chinese_macbert_large").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_macbert_large|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|1.2 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/chinese-macbert-large
- https://github.com/ymcui/MacBERT/blob/master/LICENSE
- https://2020.emnlp.org
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/2004.13922
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://github.com/chatopera/Synonyms
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/2004.13922
---
layout: model
title: Mapping Drug Brand Names with Corresponding National Drug Codes
author: John Snow Labs
name: drug_brandname_ndc_mapper
date: 2022-05-11
tags: [chunk_mapper, en, licensed, ndc, clinical]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 3.5.1
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps drug brand names to corresponding National Drug Codes (NDC). Product NDCs for each strength are returned in result and metadata.
## Predicted Entities
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/drug_brandname_ndc_mapper_en_3.5.1_3.0_1652259542096.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/drug_brandname_ndc_mapper_en_3.5.1_3.0_1652259542096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("chunk")
chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\
.setInputCols(["chunk"])\
.setOutputCol("ndc")\
.setRel("Strength_NDC")
pipeline = Pipeline().setStages([document_assembler,
chunkerMapper])
model = pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
lp = LightPipeline(model)
result = lp.fullAnnotate(["zytiga", "zyvana", "ZYVOX", "ZYTIGA"])
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("chunk")
val chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")
.setInputCols("chunk")
.setOutputCol("ndc")
.setRel("Strength_NDC")
val pipeline = new Pipeline().setStages(Array(document_assembler,
chunkerMapper))
val text_data = Seq("zytiga", "zyvana", "ZYVOX", "ZYTIGA").toDS.toDF("text")
val res = pipeline.fit(text_data).transform(text_data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.drug_brand_to_ndc").predict("""Put your text here.""")
```
## Results
```bash
|---:|:------------|:-------------------------|:----------------------------------------------------------|
| | Brandname | Strenth_NDC | Other_NDSs |
|---:|:------------|:-------------------------|:----------------------------------------------------------|
| 0 | zytiga | 500 mg/1 | 57894-195 | ['250 mg/1 | 57894-150'] |
| 1 | zyvana | 527 mg/1 | 69336-405 | [''] |
| 2 | ZYVOX | 600 mg/300mL | 0009-4992 | ['600 mg/300mL | 66298-7807', '600 mg/300mL | 0009-7807'] |
| 3 | ZYTIGA | 500 mg/1 | 57894-195 | ['250 mg/1 | 57894-150'] |
|---:|:------------|:-------------------------|:----------------------------------------------------------|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|drug_brandname_ndc_mapper|
|Compatibility:|Healthcare NLP 3.5.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|3.0 MB|
---
layout: model
title: English image_classifier_vit_rock_challenge_DeiT_solo ViTForImageClassification from dimbyTa
author: John Snow Labs
name: image_classifier_vit_rock_challenge_DeiT_solo
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rock_challenge_DeiT_solo` is a English model originally trained by dimbyTa.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rock_challenge_DeiT_solo_en_4.1.0_3.0_1660170757484.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rock_challenge_DeiT_solo_en_4.1.0_3.0_1660170757484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_rock_challenge_DeiT_solo", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_rock_challenge_DeiT_solo", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_rock_challenge_DeiT_solo|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|81.7 MB|
---
layout: model
title: Pipeline to Detect Bacterial Species (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_bacteria_pipeline
date: 2023-03-20
tags: [bacteria, bertfortokenclassification, ner, en, licensed]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_bacteria](https://nlp.johnsnowlabs.com/2022/01/07/bert_token_classifier_ner_bacteria_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bacteria_pipeline_en_4.3.0_3.2_1679305685030.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bacteria_pipeline_en_4.3.0_3.2_1679305685030.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_bacteria_pipeline", "en", "clinical/models")
text = '''Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bacteria_pipeline", "en", "clinical/models")
val text = "Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T))."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.bacteria_ner.pipeline").predict("""Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------------|--------:|------:|:------------|-------------:|
| 0 | SMSP (T) | 73 | 80 | SPECIES | 0.99985 |
| 1 | Methanoregula formicica | 167 | 189 | SPECIES | 0.999787 |
| 2 | SMSP (T) | 222 | 229 | SPECIES | 0.999871 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_bacteria_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.9 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: Turkish Bert Embeddings
author: John Snow Labs
name: bert_embeddings_bert_base_tr_cased
date: 2022-04-11
tags: [bert, embeddings, tr, open_source]
task: Embeddings
language: tr
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-tr-cased` is a Turkish model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_tr_cased_tr_3.4.2_3.0_1649675409597.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_tr_cased_tr_3.4.2_3.0_1649675409597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_tr_cased","tr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_tr_cased","tr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Spark NLP'yi seviyorum").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("tr.embed.bert_cased").predict("""Spark NLP'yi seviyorum""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_tr_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|tr|
|Size:|378.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-tr-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Zero-Shot Named Entity Recognition (RoBERTa)
author: John Snow Labs
name: zero_shot_ner_roberta
date: 2022-08-29
tags: [ner, zero_shot, licensed, clinical, en, roberta]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.2
spark_version: 3.0
supported: true
recommended: true
annotator: ZeroShotNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained with Zero-Shot Named Entity Recognition (NER) approach and it can detect any kind of defined entities with no training dataset, just pretrained RoBERTa embeddings (included in the model).
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/zero_shot_ner_roberta_en_4.0.2_3.0_1661769801401.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/zero_shot_ner_roberta_en_4.0.2_3.0_1661769801401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
zero_shot_ner = ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clincial/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("zero_shot_ner")\
.setEntityDefinitions(
{
"NAME": ["What is his name?", "What is my name?", "What is her name?"],
"CITY": ["Which city?", "Which is the city?"]
})
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "zero_shot_ner"])\
.setOutputCol("ner_chunk")\
pipeline = Pipeline(stages = [
documentAssembler,
sentenceDetector,
tokenizer,
zero_shot_ner,
ner_converter])
zero_shot_ner_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame(["Hellen works in London, Paris and Berlin. My name is Clara, I live in New York and Hellen lives in Paris.",
"John is a man who works in London, London and London."], StringType()).toDF("text")
result = zero_shot_ner_model.transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val zero_shot_ner = ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clincial/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("zero_shot_ner")
.setEntityDefinitions(Map(
"NAME"-> Array("What is his name?", "What is my name?", "What is her name?"),
"CITY"-> Array("Which city?", "Which is the city?")
))
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "zero_shot_ner"))
.setOutputCol("ner_chunk")
val pipeline = new .setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
zero_shot_ner,
ner_converter))
val data = Seq(Array("Hellen works in London, Paris and Berlin. My name is Clara, I live in New York and Hellen lives in Paris.",
"John is a man who works in London, London and London.")toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.zero_shot.ner_roberta").predict("""Hellen works in London, Paris and Berlin. My name is Clara, I live in New York and Hellen lives in Paris.""")
```
## Results
```bash
+------+---------+--------+-----+---+----------+
| token|ner_label|sentence|begin|end|confidence|
+------+---------+--------+-----+---+----------+
|Hellen| B-NAME| 0| 0| 5|0.13306311|
| works| O| 0| 7| 11| null|
| in| O| 0| 13| 14| null|
|London| B-CITY| 0| 16| 21| 0.4064213|
| ,| O| 0| 22| 22| null|
| Paris| B-CITY| 0| 24| 28|0.04597357|
| and| O| 0| 30| 32| null|
|Berlin| B-CITY| 0| 34| 39|0.16265489|
| .| O| 0| 40| 40| null|
| My| O| 1| 42| 43| null|
| name| O| 1| 45| 48| null|
| is| O| 1| 50| 51| null|
| Clara| B-NAME| 1| 53| 57| 0.9274031|
| ,| O| 1| 58| 58| null|
| I| O| 1| 60| 60| null|
| live| O| 1| 62| 65| null|
| in| O| 1| 67| 68| null|
| New| B-CITY| 1| 70| 72|0.82799006|
| York| I-CITY| 1| 74| 77|0.82799006|
| and| O| 1| 79| 81| null|
|Hellen| B-NAME| 1| 83| 88|0.40429682|
| lives| O| 1| 90| 94| null|
| in| O| 1| 96| 97| null|
| Paris| B-CITY| 1| 99|103|0.49216735|
| .| O| 1| 104|104| null|
| John| B-NAME| 0| 0| 3|0.14063153|
| is| O| 0| 5| 6| null|
| a| O| 0| 8| 8| null|
| man| O| 0| 10| 12| null|
| who| O| 0| 14| 16| null|
| works| O| 0| 18| 22| null|
| in| O| 0| 24| 25| null|
|London| B-CITY| 0| 27| 32|0.15521188|
| ,| O| 0| 33| 33| null|
|London| B-CITY| 0| 35| 40|0.12151082|
| and| O| 0| 42| 44| null|
|London| B-CITY| 0| 46| 51| 0.2650951|
| .| O| 0| 52| 52| null|
+------+---------+--------+-----+---+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|zero_shot_ner_roberta|
|Compatibility:|Healthcare NLP 4.0.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|460.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
As it is a Zero-Shot NER, no training dataset is necessary.
---
layout: model
title: Fast Neural Machine Translation Model from Arabic to Italian
author: John Snow Labs
name: opus_mt_ar_it
date: 2021-06-01
tags: [open_source, seq2seq, translation, ar, it, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: ar
target languages: it
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_it_xx_3.1.0_2.4_1622556125806.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_it_xx_3.1.0_2.4_1622556125806.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ar_it", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ar_it", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Arabic.translate_to.Italian').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ar_it|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Large Cased model (from deepset)
author: John Snow Labs
name: roberta_qa_deepset_large_squad2
date: 2022-12-02
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad2` is a English model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_large_squad2_en_4.2.4_3.0_1669987928587.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_large_squad2_en_4.2.4_3.0_1669987928587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_large_squad2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_large_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_deepset_large_squad2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/deepset/roberta-large-squad2
---
layout: model
title: Legal Question Answering (RoBerta, CUAD, Base)
author: John Snow Labs
name: legqa_roberta_cuad_base
date: 2023-01-30
tags: [en, licensed, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Legal RoBerta-based Question Answering model, trained on squad-v2, finetuned on CUAD dataset (base). In order to use it, a specific prompt is required. This is an example of it for extracting PARTIES:
```
"Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract"
```
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legqa_roberta_cuad_base_en_1.0.0_3.0_1675083334950.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legqa_roberta_cuad_base_en_1.0.0_3.0_1675083334950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
spanClassifier = nlp.RoBertaForQuestionAnswering.pretrained("legqa_roberta_cuad_base","en", "legal/models") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = nlp.Pipeline().setStages([
documentAssembler,
spanClassifier
])
text = """THIS CREDIT AGREEMENT is dated as of April 29, 2010, and is made by and
among P.H. GLATFELTER COMPANY, a Pennsylvania corporation ( the "COMPANY") and
certain of its subsidiaries. Identified on the signature pages hereto (each a
"BORROWER" and collectively, the "BORROWERS"), each of the GUARANTORS (as
hereinafter defined), the LENDERS (as hereinafter defined), PNC BANK, NATIONAL
ASSOCIATION, in its capacity as agent for the Lenders under this Agreement
(hereinafter referred to in such capacity as the "ADMINISTRATIVE AGENT"), and,
for the limited purpose of public identification in trade tables, PNC CAPITAL
MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA, as joint arrangers and joint
bookrunners, and CITIZENS BANK OF PENNSYLVANIA, as syndication agent.""".replace('\n',' ')
question = ['"Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract"']
qt = [ [q,text] for q in questions ]
example = spark.createDataFrame(qt).toDF("question", "context")
result = pipeline.fit(example).transform(example)
result.select('document_question.result', 'answer.result').show(truncate=False)
```
## Results
```bash
["Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract"]|[P . H . GLATFELTER COMPANY]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legqa_roberta_cuad_base|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|453.7 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
Squad, finetuned with CUAD-based Question/Answering
---
layout: model
title: English Named Entity Recognition (from DeDeckerThomas)
author: John Snow Labs
name: distilbert_ner_keyphrase_extraction_distilbert_kptimes
date: 2022-05-16
tags: [distilbert, ner, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-kptimes` is a English model orginally trained by `DeDeckerThomas`.
## Predicted Entities
`KEY`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_keyphrase_extraction_distilbert_kptimes_en_3.4.2_3.0_1652721921747.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_keyphrase_extraction_distilbert_kptimes_en_3.4.2_3.0_1652721921747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_keyphrase_extraction_distilbert_kptimes","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_keyphrase_extraction_distilbert_kptimes","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_ner_keyphrase_extraction_distilbert_kptimes|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/DeDeckerThomas/keyphrase-extraction-distilbert-kptimes
- https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=kptimes
---
layout: model
title: Multilingual T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_flan_small
date: 2023-01-30
tags: [vi, ne, fi, ur, ku, yo, si, ru, it, zh, la, hi, he, xh, so, ca, ar, as, sw, en, ro, ig, te, th, ta, ce, es, gu, or, fr, ka, "no", li, cr, ch, be, ha, ga, ja, pa, ko, sl, open_source, t5, xx, tensorflow]
task: Text Generation
language: xx
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `flan-t5-small` is a Multilingual model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_flan_small_xx_4.3.0_3.0_1675102370004.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_flan_small_xx_4.3.0_3.0_1675102370004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_flan_small","xx") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_flan_small","xx")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_flan_small|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|xx|
|Size:|349.5 MB|
## References
- https://huggingface.co/google/flan-t5-small
- https://s3.amazonaws.com/moonup/production/uploads/1666363435475-62441d1d9fdefb55a0b7d12c.png
- https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints
- https://arxiv.org/pdf/2210.11416.pdf
- https://github.com/google-research/t5x
- https://arxiv.org/pdf/2210.11416.pdf
- https://arxiv.org/pdf/2210.11416.pdf
- https://arxiv.org/pdf/2210.11416.pdf
- https://s3.amazonaws.com/moonup/production/uploads/1666363265279-62441d1d9fdefb55a0b7d12c.png
- https://arxiv.org/pdf/2210.11416.pdf
- https://github.com/google-research/t5x
- https://github.com/google/jax
- https://s3.amazonaws.com/moonup/production/uploads/1668072995230-62441d1d9fdefb55a0b7d12c.png
- https://arxiv.org/pdf/2210.11416.pdf
- https://arxiv.org/pdf/2210.11416.pdf
- https://mlco2.github.io/impact#compute
- https://arxiv.org/abs/1910.09700
---
layout: model
title: Multilingual XLMRoBerta Embeddings
author: John Snow Labs
name: xlmroberta_embeddings_afriberta_base
date: 2022-05-13
tags: [ha, yo, ig, am, so, open_source, xlm_roberta, embeddings, xx, afriberta]
task: Embeddings
language: xx
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
recommended: true
annotator: XlmRoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `afriberta_base` is a Multilingual model orginally trained by `castorini`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_base_xx_3.4.4_3.0_1652439193066.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_base_xx_3.4.4_3.0_1652439193066.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_base","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_base","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_embeddings_afriberta_base|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Size:|417.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/castorini/afriberta_base
- https://github.com/keleog/afriberta
---
layout: model
title: English asr_wav2vec2_med_custom_train_large TFWav2Vec2ForCTC from PrajwalS
author: John Snow Labs
name: asr_wav2vec2_med_custom_train_large
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_med_custom_train_large` is a English model originally trained by PrajwalS.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_med_custom_train_large_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_med_custom_train_large_en_4.2.0_3.0_1664122216388.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_med_custom_train_large_en_4.2.0_3.0_1664122216388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_med_custom_train_large", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_med_custom_train_large", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_med_custom_train_large|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Malay T5ForConditionalGeneration Tiny Cased model (from mesolitica)
author: John Snow Labs
name: t5_tiny_bahasa_cased
date: 2023-01-31
tags: [ms, open_source, t5, tensorflow]
task: Text Generation
language: ms
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-tiny-bahasa-cased` is a Malay model originally trained by `mesolitica`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_tiny_bahasa_cased_ms_4.3.0_3.0_1675156097275.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_tiny_bahasa_cased_ms_4.3.0_3.0_1675156097275.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_tiny_bahasa_cased","ms") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_tiny_bahasa_cased","ms")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_tiny_bahasa_cased|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|ms|
|Size:|90.5 MB|
## References
- https://huggingface.co/mesolitica/t5-tiny-bahasa-cased
- https://github.com/huseinzol05/malaya/tree/master/pretrained-model/t5/prepare
- https://github.com/google-research/text-to-text-transfer-transformer
- https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5
---
layout: model
title: Legal Relation Extraction (Parties, Alias, Dates, Document Type, Sm, Bidirectional)
author: John Snow Labs
name: legre_contract_doc_parties
date: 2022-08-12
tags: [en, legal, re, relations, agreements, licensed]
task: Relation Extraction
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
recommended: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
IMPORTANT: Don't run this model on the whole legal agreement. Instead:
- Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration;
- Use the `legclf_introduction_clause` Text Classifier to select only these paragraphs;
This is a Legal Relation Extraction model, which can be used after the NER Model for extracting Parties, Document Types, Effective Dates and Aliases, called `legner_contract_doc_parties`.
As an output, you will get the relations linking the different concepts together, if such relation exists. The list of relations is:
- dated_as: A Document has an Effective Date
- has_alias: The Alias of a Party all along the document
- has_collective_alias: An Alias hold by several parties at the same time
- signed_by: Between a Party and the document they signed
This model is a `sm` model without meaningful directions in the relations (the model was not trained to understand if the direction of the relation is from left to right or right to left). There are bigger models in Models Hub trained also with directed relationships.
## Predicted Entities
`dated_as`, `has_alias`, `has_collective_alias`, `signed_by`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/LEGALRE_PARTIES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legre_contract_doc_parties_en_1.0.0_3.2_1660293010932.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legre_contract_doc_parties_en_1.0.0_3.2_1660293010932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_base_uncased_legal", "en") \
.setInputCols("document", "token") \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk")
reDL = legal.RelationExtractionDLModel().pretrained('legre_contract_doc_parties', 'en', 'legal/models')\
.setPredictionThreshold(0.5)\
.setInputCols(["ner_chunk", "document"])\
.setOutputCol("relations")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
tokenizer,
embeddings,
ner_model,
ner_converter,
reDL
])
text='''
This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
'''
data = spark.createDataFrame([[text]]).toDF("text")
model = nlpPipeline.fit(data)
```
## Results
```bash
relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence
dated_as DOC 6 36 INTELLECTUAL PROPERTY AGREEMENT EFFDATE 70 86 December 31, 2018 0.9933402
signed_by DOC 6 36 INTELLECTUAL PROPERTY AGREEMENT PARTY 142 164 Armstrong Flooring, Inc 0.6235637
signed_by DOC 6 36 INTELLECTUAL PROPERTY AGREEMENT PARTY 316 331 AHF Holding, Inc 0.5001139
has_alias PARTY 142 164 Armstrong Flooring, Inc ALIAS 193 198 Seller 0.93385726
has_alias PARTY 206 222 AFI Licensing LLC ALIAS 264 272 Licensing 0.9859913
has_collective_alias ALIAS 293 298 Seller ALIAS 302 308 Arizona 0.82137156
has_alias PARTY 316 331 AHF Holding, Inc ALIAS 400 404 Buyer 0.8178999
has_alias PARTY 412 446 Armstrong Hardwood Flooring Company ALIAS 479 485 Company 0.9557921
has_alias PARTY 412 446 Armstrong Hardwood Flooring Company ALIAS 575 579 Buyer 0.6778585
has_alias PARTY 412 446 Armstrong Hardwood Flooring Company ALIAS 612 616 Party 0.6778583
has_alias PARTY 412 446 Armstrong Hardwood Flooring Company ALIAS 642 648 Parties 0.6778585
has_collective_alias ALIAS 506 510 Buyer ALIAS 517 530 Buyer Entities 0.69863707
has_collective_alias ALIAS 517 530 Buyer Entities ALIAS 575 579 Buyer 0.55453944
has_collective_alias ALIAS 517 530 Buyer Entities ALIAS 612 616 Party 0.55453944
has_collective_alias ALIAS 517 530 Buyer Entities ALIAS 642 648 Parties 0.55453944
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legre_contract_doc_parties|
|Type:|legal|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|409.9 MB|
## References
Manual annotations on CUAD dataset
## Benchmarking
```bash
label Recall Precision F1 Support
dated_as 0.962 0.962 0.962 26
has_alias 0.936 0.946 0.941 94
has_collective_alias 1.000 1.000 1.000 7
no_rel 0.982 0.980 0.981 497
signed_by 0.961 0.961 0.961 76
Avg. 0.968 0.970 0.969 -
Weighted-Avg. 0.973 0.973 0.973 -
```
---
layout: model
title: Hocr for table recognition pdf
author: John Snow Labs
name: hocr_table_recognition_pdf
date: 2023-01-23
tags: [en, licensed]
task: HOCR Table Recognition
language: en
nav_key: models
edition: Visual NLP 4.2.4
spark_version: 3.2.1
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Table structure recognition based on hocr with Tesseract architecture, for PDF documents.
Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset.
In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/ocr/PDF_TABLE_RECOGNITION_HOCR/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/tree/master/jupyter/SparkOCRPdfToTable.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pdf_to_hocr = PdfToHocr() \
.setInputCol("content") \
.setOutputCol("hocr")
tokenizer = HocrTokenizer() \
.setInputCol("hocr") \
.setOutputCol("token") \
pdf_to_image = PdfToImage() \
.setInputCol("content") \
.setOutputCol("image") \
.setPageNumCol("tmp_pagenum") \
.setImageType(ImageType.TYPE_3BYTE_BGR)
table_detector = ImageTableDetector \
.pretrained("general_model_table_detection_v2", "en", "public/ocr/models") \
.setInputCol("image") \
.setOutputCol("table_regions") \
.setScoreThreshold(0.9) \
.setApplyCorrection(True) \
.setScaleWidthToCol("width_dimension") \
.setScaleHeightToCol("height_dimension")
image_scaler = ImageScaler() \
.setWidthCol("width_dimension") \
.setHeightCol("height_dimension")
hocr_to_table = HocrToTextTable() \
.setInputCol("hocr") \
.setRegionCol("table_regions") \
.setOutputCol("tables")
draw_annotations = ImageDrawAnnotations() \
.setInputCol("scaled_image") \
.setInputChunksCol("tables") \
.setOutputCol("image_with_annotations") \
.setFilledRect(False) \
.setFontSize(5) \
.setRectColor(Color.red)
draw_regions = ImageDrawRegions() \
.setInputCol("scaled_image") \
.setInputRegionsCol("table_regions") \
.setOutputCol("image_with_regions") \
.setRectColor(Color.red)
pipeline1 = PipelineModel(stages=[
pdf_to_hocr,
tokenizer,
pdf_to_image,
table_detector,
image_scaler,
draw_regions,
hocr_to_table
])
test_image_path = "data/pdfs/f1120.pdf"
bin_df = spark.read.format("binaryFile").load(test_image_path)
result = pipeline1.transform(bin_df).cache().drop("tmp_pagenum")
result = result.filter(result.pagenum == 1)
```
```scala
val pdf_to_hocr = new PdfToHocr()
.setInputCol("content")
.setOutputCol("hocr")
val tokenizer = new HocrTokenizer()
.setInputCol("hocr")
.setOutputCol("token")
val pdf_to_image = new PdfToImage()
.setInputCol("content")
.setOutputCol("image")
.setPageNumCol("tmp_pagenum")
.setImageType(ImageType.TYPE_3BYTE_BGR)
val table_detector = ImageTableDetector
.pretrained("general_model_table_detection_v2", "en", "public/ocr/models")
.setInputCol("image")
.setOutputCol("table_regions")
.setScoreThreshold(0.9)
.setApplyCorrection(True)
.setScaleWidthToCol("width_dimension")
.setScaleHeightToCol("height_dimension")
val image_scaler = new ImageScaler()
.setWidthCol("width_dimension")
.setHeightCol("height_dimension")
val hocr_to_table = new HocrToTextTable()
.setInputCol("hocr")
.setRegionCol("table_regions")
.setOutputCol("tables")
val draw_annotations = new ImageDrawAnnotations()
.setInputCol("scaled_image")
.setInputChunksCol("tables")
.setOutputCol("image_with_annotations")
.setFilledRect(False)
.setFontSize(5)
.setRectColor(Color.red)
val draw_regions = new ImageDrawRegions()
.setInputCol("scaled_image")
.setInputRegionsCol("table_regions")
.setOutputCol("image_with_regions")
.setRectColor(Color.red)
val pipeline1 = new PipelineModel().setStages(Array(
pdf_to_hocr,
tokenizer,
pdf_to_image,
table_detector,
image_scaler,
draw_regions,
hocr_to_table))
val test_image_path = "data/pdfs/f1120.pdf"
val bin_df = spark.read.format("binaryFile").load(test_image_path)
val result = pipeline1.transform(bin_df).cache().drop("tmp_pagenum")
result = result.filter(result.pagenum == 1)
```
Live Demo
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_RoBERTa_hindi_guj_san_hi_3.4.2_3.0_1649947496602.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_RoBERTa_hindi_guj_san_hi_3.4.2_3.0_1649947496602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_RoBERTa_hindi_guj_san","hi") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_RoBERTa_hindi_guj_san","hi")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("hi.embed.RoBERTa_hindi_guj_san").predict("""मुझे स्पार्क एनएलपी पसंद है""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_RoBERTa_hindi_guj_san|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|hi|
|Size:|252.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/surajp/RoBERTa-hindi-guj-san
- https://github.com/goru001/inltk
- https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k
- https://www.kaggle.com/disisbig/gujarati-wikipedia-articles
- https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles
- https://twitter.com/parmarsuraj99
- https://www.linkedin.com/in/parmarsuraj99/
---
layout: model
title: Multilingual XLMRobertaForTokenClassification Base Cased model (from iis2009002)
author: John Snow Labs
name: xlmroberta_ner_iis2009002_base_finetuned_panx_all
date: 2022-08-13
tags: [xx, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `iis2009002`.
## Predicted Entities
`ORG`, `LOC`, `PER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_iis2009002_base_finetuned_panx_all_xx_4.1.0_3.0_1660428464219.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_iis2009002_base_finetuned_panx_all_xx_4.1.0_3.0_1660428464219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_iis2009002_base_finetuned_panx_all","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_iis2009002_base_finetuned_panx_all","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_iis2009002_base_finetuned_panx_all|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|861.7 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/iis2009002/xlm-roberta-base-finetuned-panx-all
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4_Modified_biobert_v1.1
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4_Modified-biobert-v1.1` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Modified_biobert_v1.1_en_4.0.0_3.0_1657109068197.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Modified_biobert_v1.1_en_4.0.0_3.0_1657109068197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Modified_biobert_v1.1","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Modified_biobert_v1.1","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4_Modified_biobert_v1.1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|403.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4_Modified-biobert-v1.1
---
layout: model
title: Financial Relation Extraction (Work Experience, Medium, Unidirectional)
author: John Snow Labs
name: finre_work_experience_md
date: 2022-11-08
tags: [work, experience, role, en, licensed]
task: Relation Extraction
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: RelationExtractionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
IMPORTANT: Don't run this model on the whole financial report. Instead:
- Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration;
- Use the `finclf_work_experience_item` Text Classifier to select only these paragraphs;
This is a `md` (medium) version of `finre_work_experience` model, trained with more data and with **unidirectional relation extractions**, meaning now the direction of the arrow matters: it goes from the source (`chunk1`) to the target (`chunk2`).
This model allows you to analyzed present and past job positions of people, extracting relations between PERSON, ORG, ROLE and DATE. This model requires an NER with the mentioned entities, as `finner_org_per_role_date` and can also be combined with `finassertiondl_past_roles` to detect if the entities are mentioned to have happened in the PAST or not (although you can also infer that from the relations as `had_role_until`).
## Predicted Entities
`has_role`, `had_role_until`, `has_role_from`, `works_for`, `has_role_in_company`
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_work_experience_md_en_1.0.0_3.0_1667922980930.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_work_experience_md_en_1.0.0_3.0_1667922980930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencizer = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl", "en") \
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
bert_embeddings= BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
.setInputCols(["sentence", "token"])\
.setOutputCol("bert_embeddings")
ner_model = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
.setInputCols(["sentence", "token", "bert_embeddings"])\
.setOutputCol("ner_orgs")
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner_orgs"])\
.setOutputCol("ner_chunk")
pos = PerceptronModel.pretrained()\
.setInputCols(["sentence", "token"])\
.setOutputCol("pos")
dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en")\
.setInputCols(["sentence", "pos", "token"])\
.setOutputCol("dependencies")
re_filter = finance.RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setRelationPairs(["PERSON-ROLE", "PERSON-ORG", "ORG-ROLE", "DATE-ROLE"])
reDL = finance.RelationExtractionDLModel()\
.pretrained('finre_work_experience_md','en','finance/models')\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relations")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentencizer,
tokenizer,
bert_embeddings,
ner_model,
ner_converter,
pos,
dependency_parser,
re_filter,
reDL])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = f"On December 15, 2021, Anirudh Devgan assumed the role of President and Chief Executive Officer of Cadence, replacing Lip-Bu Tan. Prior to his role as Chief Executive Officer, Dr. Devgan served as President of Cadence. Concurrently, Mr. Tan transitioned to the role of Executive Chair"
lmodel = LightPipeline(model)
results = lmodel.fullAnnotate(text)
rel_df = get_relations_df (results)
rel_df = rel_df[rel_df['relation']!='other']
print(rel_df.to_string(index=False))
print()
```
## Results
```bash
relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence
has_role_from DATE 3 19 December 15, 2021 ROLE 57 65 President 0.9532135
has_role_from DATE 3 19 December 15, 2021 ROLE 71 93 Chief Executive Officer 0.91833746
has_role PERSON 22 35 Anirudh Devgan ROLE 57 65 President 0.9993814
has_role PERSON 22 35 Anirudh Devgan ROLE 71 93 Chief Executive Officer 0.9889985
works_for PERSON 22 35 Anirudh Devgan ORG 98 104 Cadence 0.9983778
has_role_in_company ROLE 57 65 President ORG 98 104 Cadence 0.9997348
has_role_in_company ROLE 71 93 Chief Executive Officer ORG 98 104 Cadence 0.99845624
has_role ROLE 150 172 Chief Executive Officer PERSON 175 184 Dr. Devgan 0.85268635
has_role_in_company ROLE 150 172 Chief Executive Officer ORG 209 215 Cadence 0.9976404
has_role PERSON 175 184 Dr. Devgan ROLE 196 204 President 0.99899226
works_for PERSON 175 184 Dr. Devgan ORG 209 215 Cadence 0.99876934
has_role_in_company ROLE 196 204 President ORG 209 215 Cadence 0.9997203
has_role PERSON 232 238 Mr. Tan ROLE 268 282 Executive Chair 0.98612714
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finre_work_experience_md|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|405.7 MB|
## References
Manual annotations on CUAD dataset, 10K filings and Wikidata
## Benchmarking
```bash
label Recall Precision F1 Support
had_role_until 1.000 1.000 1.000 117
has_role 0.998 0.995 0.997 649
has_role_from 1.000 1.000 1.000 401
has_role_in_company 0.993 0.993 0.993 268
other 0.996 0.996 0.996 235
works_for 0.994 1.000 0.997 330
Avg. 0.997 0.997 0.997 2035
Weighted-Avg. 0.997 0.997 0.997 2035
```
---
layout: model
title: Detect PHI in text (enriched-biobert)
author: John Snow Labs
name: ner_deid_enriched_biobert
date: 2021-04-01
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Detect sensitive information in text for de-identification using pretrained NER model.
We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/)
## Predicted Entities
`DOCTOR`, `PHONE`, `COUNTRY`, `MEDICALRECORD`, `STREET`, `CITY`, `PROFESSION`, `PATIENT`, `IDNUM`, `BIOID`, `HEALTHPLAN`, `HOSPITAL`, `USERNAME`, `LOCATION-OTHER`, `AGE`, `FAX`, `EMAIL`, `DATE`, `STATE`, `ZIP`, `URL`, `ORGANIZATION`, `DEVICE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_en_3.0.0_3.0_1617260810027.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_en_3.0.0_3.0_1617260810027.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_enriched_biobert", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings_clinical,
clinical_ner,
ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_deid_enriched_biobert", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings_clinical,
ner,
ner_converter))
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.deid.enriched_biobert").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_enriched_biobert|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
---
layout: model
title: English asr_Central_kurdish_xlsr TFWav2Vec2ForCTC from Akashpb13
author: John Snow Labs
name: asr_Central_kurdish_xlsr
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Central_kurdish_xlsr` is a English model originally trained by Akashpb13.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Central_kurdish_xlsr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Central_kurdish_xlsr_en_4.2.0_3.0_1664103765643.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Central_kurdish_xlsr_en_4.2.0_3.0_1664103765643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_Central_kurdish_xlsr", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_Central_kurdish_xlsr", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_Central_kurdish_xlsr|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Sentence Entity Resolver for RxNorm (Action / Treatment)
author: John Snow Labs
name: sbiobertresolve_rxnorm_action_treatment
date: 2022-04-25
tags: [licensed, en, entity_resolution, clinical, rxnorm]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.5.1
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Additionally, this model returns actions and treatments of the drugs in `all_k_aux_labels` column.
## Predicted Entities
`RxNorm Codes`, `Action`, `Treatment`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_action_treatment_en_3.5.1_2.4_1650899853599.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_action_treatment_en_3.5.1_2.4_1650899853599.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_action_treatment", "en", "clinical/models")\
.setInputCols(["sbert_embeddings"])\
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, rxnorm_resolver ])
light_model = LightPipeline(pipelineModel)
result = light_model.fullAnnotate(["Zita 200 mg", "coumadin 5 mg", "avandia 4 mg"])
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_action_treatment", "en", "clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val rxnorm_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, rxnorm_resolver))
val light_model = LightPipeline(rxnorm_pipelineModel)
val result = light_model.fullAnnotate(Array("Zita 200 mg", "coumadin 5 mg", "avandia 4 mg"))
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.rxnorm_action_treatment").predict("""coumadin 5 mg""")
```
## Results
```bash
| | ner_chunk | rxnorm_code | action | treatment |
|---:|:--------------|--------------:|:---------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | Zita 200 mg | 104080 | ['Analgesic', 'Antacid', 'Antipyretic', 'Pain Reliever'] | ['Backache', 'Pain', 'Sore Throat', 'Headache', 'Influenza', 'Toothache', 'Heartburn', 'Migraine', 'Muscular Aches And Pains', 'Neuralgia', 'Cold', 'Weakness'] |
| 1 | coumadin 5 mg | 855333 | ['Anticoagulant'] | ['Cerebrovascular Accident', 'Pulmonary Embolism', 'Heart Attack', 'AF', 'Embolization'] |
| 2 | avandia 4 mg | 261242 | ['Drugs Used In Diabets', 'Hypoglycemic'] | ['Diabetes Mellitus', 'Type 1 Diabetes Mellitus', 'Type 2 Diabetes'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_rxnorm_action_treatment|
|Compatibility:|Healthcare NLP 3.5.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[rxnorm_code]|
|Language:|en|
|Size:|918.7 MB|
|Case sensitive:|false|
---
layout: model
title: English asr_wav2vec2_large_robust_swbd_300h TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: asr_wav2vec2_large_robust_swbd_300h
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_robust_swbd_300h` is a English model originally trained by facebook.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_robust_swbd_300h_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_robust_swbd_300h_en_4.2.0_3.0_1664038284772.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_robust_swbd_300h_en_4.2.0_3.0_1664038284772.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_robust_swbd_300h", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_robust_swbd_300h", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_robust_swbd_300h|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|757.5 MB|
---
layout: model
title: English BertForQuestionAnswering model (from maroo93)
author: John Snow Labs
name: bert_qa_squad2.0
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad2.0` is a English model orginally trained by `maroo93`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad2.0_en_4.0.0_3.0_1654192176164.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad2.0_en_4.0.0_3.0_1654192176164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/maroo93/squad2.0
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from vkrishnamoorthy)
author: John Snow Labs
name: distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vkrishnamoorthy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773167222.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773167222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/vkrishnamoorthy/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Named Entity Recognition for Japanese (BERT Base Japanese)
author: John Snow Labs
name: ner_ud_gsd_bert_base_japanese
date: 2021-09-16
tags: [ja, ner, open_sourve, open_source]
task: Named Entity Recognition
language: ja
edition: Spark NLP 3.2.2
spark_version: 3.0
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together.
This model uses the pretrained BertEmbeddings embeddings "bert_base_ja" as an input, so be sure to use the same embeddings in the pipeline.
## Predicted Entities
`ORDINAL`, `PERSON`, `LAW`, `MOVEMENT`, `LOC`, `WORK_OF_ART`, `DATE`, `NORP`, `TITLE_AFFIX`, `QUANTITY`, `FAC`, `TIME`, `MONEY`, `LANGUAGE`, `GPE`, `EVENT`, `ORG`, `PERCENT`, `PRODUCT`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_bert_base_japanese_ja_3.2.2_3.0_1631804789491.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_bert_base_japanese_ja_3.2.2_3.0_1631804789491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
import sparknlp
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.training import *
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_base_japanese", "ja") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nerTagger = NerDLModel.pretrained("ner_ud_gsd_bert_base_japanese", "ja") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
pipeline = Pipeline().setStages(
[
documentAssembler,
sentence,
word_segmenter,
embeddings,
nerTagger,
]
)
data = spark.createDataFrame([["宮本茂氏は、日本の任天堂のゲームプロデューサーです。"]]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(arrays_zip(token.result, ner.result))").show()
```
```scala
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{SentenceDetector, WordSegmenterModel}
import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_base_japanese", "ja")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val nerTagger = NerDLModel.pretrained("ner_ud_gsd_bert_base_japanese", "ja")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
word_segmenter,
embeddings,
nerTagger
))
val data = Seq("宮本茂氏は、日本の任天堂のゲームプロデューサーです。").toDF("text")
val model = pipeline.fit(data)
val result = model.transform(data)
result.selectExpr("explode(arrays_zip(token.result, ner.result))").show()
```
## Results
```bash
# +-------------------+
# | col|
# +-------------------+
# | {宮本, B-PERSON}|
# | {茂, I-PERSON}|
# | {氏, O}|
# | {は, O}|
# | {、, O}|
# | {日本, B-GPE}|
# | {の, O}|
# | {任天, B-ORG}|
# | {堂, I-ORG}|
# | {の, O}|
# | {ゲーム, O}|
# |{プロデューサー, O}|
# | {です, O}|
# | {。, O}|
# +-------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_ud_gsd_bert_base_japanese|
|Type:|ner|
|Compatibility:|Spark NLP 3.2.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ja|
|Dependencies:|bert_base_ja|
## Data Source
The model was trained on the Universal Dependencies, curated by Google. A NER version was created by megagonlabs:
https://github.com/megagonlabs/UD_Japanese-GSD
Reference:
Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018.
## Benchmarking
```bash
label precision recall f1-score support
CARDINAL 0.00 0.00 0.00 0
DATE 0.95 0.96 0.96 206
EVENT 0.84 0.50 0.63 52
FAC 0.75 0.71 0.73 59
GPE 0.79 0.76 0.78 102
LANGUAGE 1.00 1.00 1.00 8
LAW 1.00 0.31 0.47 13
LOC 0.89 0.83 0.86 41
MONEY 1.00 1.00 1.00 20
MOVEMENT 1.00 0.18 0.31 11
NORP 0.85 0.82 0.84 57
O 0.99 0.99 0.99 11785
ORDINAL 0.81 0.94 0.87 32
ORG 0.78 0.65 0.71 179
PERCENT 0.89 1.00 0.94 16
PERSON 0.76 0.84 0.80 127
PRODUCT 0.62 0.68 0.65 50
QUANTITY 0.92 0.94 0.93 172
TIME 0.97 0.88 0.92 32
TITLE_AFFIX 0.89 0.71 0.79 24
WORK_OF_ART 0.66 0.73 0.69 48
accuracy - - 0.97 13034
macro-avg 0.83 0.73 0.75 13034
weighted-avg 0.97 0.97 0.97 13034
```
---
layout: model
title: English BertForTokenClassification Small Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4_modified_PubmedBert_small
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4-modified-PubmedBert_small` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_modified_PubmedBert_small_en_4.0.0_3.0_1657108180187.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_modified_PubmedBert_small_en_4.0.0_3.0_1657108180187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_modified_PubmedBert_small","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_modified_PubmedBert_small","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4_modified_PubmedBert_small|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|408.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4-modified-PubmedBert_small
---
layout: model
title: Pipeline to Detect Disease Mentions (MedicalBertForTokenClassification) (BERT)
author: John Snow Labs
name: bert_token_classifier_disease_mentions_tweet_pipeline
date: 2023-03-20
tags: [es, clinical, licensed, public_health, ner, token_classification, disease, tweet]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_disease_mentions_tweet](https://nlp.johnsnowlabs.com/2022/07/28/bert_token_classifier_disease_mentions_tweet_es_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_disease_mentions_tweet_pipeline_es_4.3.0_3.2_1679299531828.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_disease_mentions_tweet_pipeline_es_4.3.0_3.2_1679299531828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_disease_mentions_tweet_pipeline", "es", "clinical/models")
text = '''El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_disease_mentions_tweet_pipeline", "es", "clinical/models")
val text = "El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:----------------------|--------:|------:|:------------|-------------:|
| 0 | Neumonía en el pulmón | 41 | 61 | ENFERMEDAD | 0.999969 |
| 1 | Sinusitis | 72 | 80 | ENFERMEDAD | 0.999977 |
| 2 | Faringitis aguda | 94 | 109 | ENFERMEDAD | 0.999969 |
| 3 | infección de orina | 113 | 130 | ENFERMEDAD | 0.999969 |
| 4 | Gripe | 150 | 154 | ENFERMEDAD | 0.999983 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_disease_mentions_tweet_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|es|
|Size:|462.2 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from google)
author: John Snow Labs
name: t5_efficient_base_nl16
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nl16` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl16_en_4.3.0_3.0_1675113699074.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl16_en_4.3.0_3.0_1675113699074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_nl16","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_nl16","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_nl16|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|602.0 MB|
## References
- https://huggingface.co/google/t5-efficient-base-nl16
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English asr_wav2vec2_base_100h_by_vuiseng9 TFWav2Vec2ForCTC from vuiseng9
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_100h_by_vuiseng9
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_by_vuiseng9` is a English model originally trained by vuiseng9.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_by_vuiseng9_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_by_vuiseng9_en_4.2.0_3.0_1664022865816.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_by_vuiseng9_en_4.2.0_3.0_1664022865816.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_by_vuiseng9', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_by_vuiseng9", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_100h_by_vuiseng9|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|227.9 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English image_classifier_vit_taco_or_what ViTForImageClassification from osanseviero
author: John Snow Labs
name: image_classifier_vit_taco_or_what
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_taco_or_what` is a English model originally trained by osanseviero.
## Predicted Entities
`burrito`, `taco`, `quesadilla`, `fajitas`, `kebab`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_taco_or_what_en_4.1.0_3.0_1660169560946.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_taco_or_what_en_4.1.0_3.0_1660169560946.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_taco_or_what", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_taco_or_what", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_taco_or_what|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Sentence Embeddings - sbert medium (tuned)
author: John Snow Labs
name: sbert_jsl_medium_umls_uncased
date: 2021-05-14
tags: [embeddings, clinical, licensed, en]
task: Embeddings
language: en
nav_key: models
edition: Healthcare NLP 3.0.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained to generate contextual sentence embeddings of input sentences.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_umls_uncased_en_3.0.3_2.4_1621017148548.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_umls_uncased_en_3.0.3_2.4_1621017148548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sbiobert_embeddings = BertSentenceEmbeddings\
.pretrained("sbert_jsl_medium_umls_uncased","en","clinical/models")\
.setInputCols(["sentence"])\
.setOutputCol("sbert_embeddings")
```
```scala
val sbiobert_embeddings = BertSentenceEmbeddings
.pretrained("sbert_jsl_medium_umls_uncased","en","clinical/models")
.setInputCols("sentence")
.setOutputCol("sbert_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed_sentence.bert.jsl_medium_umls_uncased").predict("""Put your text here.""")
```
## Results
```bash
Gives a 768 dimensional vector representation of the sentence.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbert_jsl_medium_umls_uncased|
|Compatibility:|Healthcare NLP 3.0.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Case sensitive:|false|
## Data Source
Tuned on MedNLI and UMLS dataset
## Benchmarking
```bash
MedNLI Score
Acc 0.744
STS(cos) 0.695
```
---
layout: model
title: Multilingual T5ForConditionalGeneration Base Cased model (from KETI-AIR)
author: John Snow Labs
name: t5_ke_base
date: 2023-01-30
tags: [en, ko, open_source, t5, xx, tensorflow]
task: Text Generation
language: xx
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ke-t5-base` is a Multilingual model originally trained by `KETI-AIR`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ke_base_xx_4.3.0_3.0_1675104312892.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ke_base_xx_4.3.0_3.0_1675104312892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_ke_base","xx") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_ke_base","xx")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_ke_base|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|xx|
|Size:|569.3 MB|
## References
- https://huggingface.co/KETI-AIR/ke-t5-base
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints
- https://github.com/AIRC-KETI/ke-t5
- https://aclanthology.org/2021.findings-emnlp.33/
- https://jmlr.org/papers/volume21/20-074/20-074.pdf
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://aclanthology.org/2021.acl-long.330.pdf
- https://dl.acm.org/doi/pdf/10.1145/3442188.3445922
- https://www.tensorflow.org/datasets/catalog/c4
- https://jmlr.org/papers/volume21/20-074/20-074.pdf
- https://jmlr.org/papers/volume21/20-074/20-074.pdf
- https://jmlr.org/papers/volume21/20-074/20-074.pdf
- https://mlco2.github.io/impact#compute
- https://arxiv.org/abs/1910.09700
- https://colab.research.google.com/github/google-research/text-to-text-transfer-transformer/blob/main/notebooks/t5-trivia.ipynb
---
layout: model
title: Translate Portuguese-based languages to English Pipeline
author: John Snow Labs
name: translate_cpp_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, cpp, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `cpp`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_cpp_en_xx_2.7.0_2.4_1609687142938.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_cpp_en_xx_2.7.0_2.4_1609687142938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_cpp_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_cpp_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.cpp.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_cpp_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from mcurmei)
author: John Snow Labs
name: distilbert_qa_single_label_n_max
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `single_label_N_max` is a English model originally trained by `mcurmei`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_single_label_n_max_en_4.3.0_3.0_1672775498618.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_single_label_n_max_en_4.3.0_3.0_1672775498618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_single_label_n_max","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_single_label_n_max","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_single_label_n_max|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mcurmei/single_label_N_max
---
layout: model
title: Japanese Electra Embeddings (from izumi-lab)
author: John Snow Labs
name: electra_embeddings_electra_small_japanese_fin_generator
date: 2022-05-17
tags: [ja, open_source, electra, embeddings]
task: Embeddings
language: ja
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-japanese-fin-generator` is a Japanese model orginally trained by `izumi-lab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_japanese_fin_generator_ja_3.4.4_3.0_1652786680826.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_japanese_fin_generator_ja_3.4.4_3.0_1652786680826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_japanese_fin_generator","ja") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_japanese_fin_generator","ja")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Spark NLPが大好きです").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_electra_small_japanese_fin_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|ja|
|Size:|52.6 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/izumi-lab/electra-small-japanese-fin-generator
- https://github.com/google-research/electra
- https://github.com/retarfi/language-pretraining/tree/v1.0
- https://github.com/google-research/electra
- https://arxiv.org/abs/2003.10555
- https://creativecommons.org/licenses/by-sa/4.0/
---
layout: model
title: English ElectraForQuestionAnswering model (from ptran74) Version-5
author: John Snow Labs
name: electra_qa_DSPFirst_Finetuning_5
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `DSPFirst-Finetuning-5` is a English model originally trained by `ptran74`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_5_en_4.0.0_3.0_1655919805104.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_5_en_4.0.0_3.0_1655919805104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_5","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_5","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.electra.finetuning_5").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_DSPFirst_Finetuning_5|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ptran74/DSPFirst-Finetuning-5
- https://github.gatech.edu/pages/VIP-ITS/textbook_SQuAD_explore/explore/textbookv1.0/textbook/
- https://github.com/patil-suraj/question_generation
---
layout: model
title: Wolof XLMRobertaForTokenClassification Base Cased model (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof
date: 2022-08-01
tags: [wo, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: wo
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-wolof-finetuned-ner-wolof` is a Wolof model originally trained by `mbeukman`.
## Predicted Entities
`DATE`, `PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof_wo_4.1.0_3.0_1659356140998.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof_wo_4.1.0_3.0_1659356140998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof","wo") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof","wo")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|wo|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-wolof-finetuned-ner-wolof
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://github.com/masakhane-io/masakhane-ner
---
layout: model
title: Legal Guarantee Clause Binary Classifier
author: John Snow Labs
name: legclf_guarantee_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `guarantee` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `guarantee`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_guarantee_clause_en_1.0.0_3.2_1660123571558.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_guarantee_clause_en_1.0.0_3.2_1660123571558.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[guarantee]|
|[other]|
|[other]|
|[guarantee]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_guarantee_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.2 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
guarantee 0.88 0.80 0.83 88
other 0.91 0.95 0.93 192
accuracy - - 0.90 280
macro-avg 0.89 0.87 0.88 280
weighted-avg 0.90 0.90 0.90 280
```
---
layout: model
title: English ElectraForQuestionAnswering Small model (from mrm8488)
author: John Snow Labs
name: electra_qa_small_finetuned_squadv1
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-finetuned-squadv1` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_small_finetuned_squadv1_en_4.0.0_3.0_1655921278345.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_small_finetuned_squadv1_en_4.0.0_3.0_1655921278345.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_small_finetuned_squadv1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_small_finetuned_squadv1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.electra.small").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_small_finetuned_squadv1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|51.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/electra-small-finetuned-squadv1
- https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/
---
layout: model
title: Legal Negative covenants Clause Binary Classifier
author: John Snow Labs
name: legclf_negative_covenants_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `negative-covenants` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `negative-covenants`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_negative_covenants_clause_en_1.0.0_3.2_1660122676325.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_negative_covenants_clause_en_1.0.0_3.2_1660122676325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[negative-covenants]|
|[other]|
|[other]|
|[negative-covenants]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_negative_covenants_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
negative-covenants 1.00 0.92 0.96 51
other 0.97 1.00 0.98 130
accuracy - - 0.98 181
macro-avg 0.99 0.96 0.97 181
weighted-avg 0.98 0.98 0.98 181
```
---
layout: model
title: Detect Organism in Medical Text
author: John Snow Labs
name: bert_token_classifier_ner_species
date: 2022-07-25
tags: [en, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Species-800 is a corpus for species entities, which is based on manually annotated abstracts. It comprises 800 PubMed abstracts that contain identified organism mentions.
This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP.
## Predicted Entities
`SPECIES`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_species_en_4.0.0_3.0_1658758056681.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_species_en_4.0.0_3.0_1658758056681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_species", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("ner")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) ."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_species", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
.setCaseSensitive(True)
.setMaxSentenceLength(512)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter))
val data = Seq("""As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) .""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.species").predict("""As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) .""")
```
## Results
```bash
+-----------------------+-------+
|ner_chunk |label |
+-----------------------+-------+
|6C (T) |SPECIES|
|Betaproteobacteria |SPECIES|
|Thiomonas intermedia |SPECIES|
|DSM 18155 (T) |SPECIES|
|Thiomonas perometabolis|SPECIES|
|DSM 18570 (T) |SPECIES|
+-----------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_species|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
[https://species.jensenlab.org/](https://species.jensenlab.org/)
## Benchmarking
```bash
label precision recall f1-score support
B-SPECIES 0.6073 0.9374 0.7371 767
I-SPECIES 0.7418 0.8648 0.7986 1043
micro-avg 0.6754 0.8956 0.7701 1810
macro-avg 0.6745 0.9011 0.7678 1810
weighted-avg 0.6848 0.8956 0.7725 1810
```
---
layout: model
title: Sentiment Analysis pipeline for English (analyze_sentimentdl_glove_imdb)
author: John Snow Labs
name: analyze_sentimentdl_glove_imdb
date: 2021-03-24
tags: [open_source, english, analyze_sentimentdl_glove_imdb, pipeline, en]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The analyze_sentimentdl_glove_imdb is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_glove_imdb_en_3.0.0_3.0_1616544505213.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_glove_imdb_en_3.0.0_3.0_1616544505213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('analyze_sentimentdl_glove_imdb', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("analyze_sentimentdl_glove_imdb", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.sentiment.glove').predict(text)
result_df
```
## Results
```bash
| | document | sentence | tokens | word_embeddings | sentence_embeddings | sentiment |
|---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-----------------------------|:------------|
| 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[0.2668800055980682,.,...]] | [[0.0771183446049690,.,...]] | ['neg'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|analyze_sentimentdl_glove_imdb|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
---
layout: model
title: Extract Pharmacological Entities From Spanish Medical Texts (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_pharmacology
date: 2022-08-11
tags: [es, clinical, licensed, token_classification, bert, ner, pharmacology]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.0.2
spark_version: 3.0
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Named Entity Recognition model is intended for detecting pharmacological entities from Spanish medical texts and trained using the BertForTokenClassification method from the transformers library and [BERT based](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) embeddings.
The model detects PROTEINAS and NORMALIZABLES.
## Predicted Entities
`PROTEINAS`, `NORMALIZABLES`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_pharmacology_es_4.0.2_3.0_1660236427687.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_pharmacology_es_4.0.2_3.0_1660236427687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_pharmacology", "es", "clinical/models")\
.setInputCols("token", "sentence")\
.setOutputCol("label")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["sentence","token","label"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter])
data = spark.createDataFrame([["""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa)."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_pharmacology", "es", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("label")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","label"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter))
val data = Seq(Array("Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).")).toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.classify.bert_token.pharmacology").predict("""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).""")
```
## Results
```bash
+---------------+-------------+
|chunk |ner_label |
+---------------+-------------+
|creatinkinasa |PROTEINAS |
|LDH |PROTEINAS |
|urea |NORMALIZABLES|
|CA 19.9 |PROTEINAS |
|vimentina |PROTEINAS |
|S-100 |PROTEINAS |
|HMB-45 |PROTEINAS |
|actina |PROTEINAS |
|Cisplatino |NORMALIZABLES|
|Interleukina II|PROTEINAS |
|Dacarbacina |NORMALIZABLES|
|Interferon alfa|PROTEINAS |
+---------------+-------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_pharmacology|
|Compatibility:|Healthcare NLP 4.0.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|410.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## Benchmarking
```bash
label precision recall f1-score support
B-NORMALIZABLES 0.9458 0.9694 0.9575 3076
I-NORMALIZABLES 0.8788 0.8969 0.8878 291
B-PROTEINAS 0.9164 0.9369 0.9265 2234
I-PROTEINAS 0.8825 0.7634 0.8186 748
micro-avg 0.9257 0.9304 0.9280 6349
macro-avg 0.9059 0.8917 0.8976 6349
weighted-avg 0.9249 0.9304 0.9270 6349
```
---
layout: model
title: Translate Manx to English Pipeline
author: John Snow Labs
name: translate_gv_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, gv, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `gv`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gv_en_xx_2.7.0_2.4_1609686733139.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gv_en_xx_2.7.0_2.4_1609686733139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_gv_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_gv_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.gv.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_gv_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Death Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_death_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, death, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Death` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Death`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_death_bert_en_1.0.0_3.0_1678049968182.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_death_bert_en_1.0.0_3.0_1678049968182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Death]|
|[Other]|
|[Other]|
|[Death]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_death_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Death 0.86 1.00 0.93 31
Other 1.00 0.90 0.95 49
accuracy - - 0.94 80
macro-avg 0.93 0.95 0.94 80
weighted-avg 0.95 0.94 0.94 80
```
---
layout: model
title: English T5ForConditionalGeneration Cased model (from KES)
author: John Snow Labs
name: t5_kes
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5-KES` is a English model originally trained by `KES`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_kes_en_4.3.0_3.0_1675099343508.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_kes_en_4.3.0_3.0_1675099343508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_kes","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_kes","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_kes|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|912.8 MB|
## References
- https://huggingface.co/KES/T5-KES
- https://arxiv.org/abs/1702.04066
- https://github.com/EricFillion/happy-transformer
- https://pypi.org/project/Caribe/
---
layout: model
title: Legal Sanctions Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_sanctions_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, sanctions, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Sanctions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Sanctions`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sanctions_bert_en_1.0.0_3.0_1678050581574.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sanctions_bert_en_1.0.0_3.0_1678050581574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Sanctions]|
|[Other]|
|[Other]|
|[Sanctions]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_sanctions_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.2 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 1.00 0.95 0.97 19
Sanctions 0.92 1.00 0.96 11
accuracy - - 0.97 30
macro-avg 0.96 0.97 0.96 30
weighted-avg 0.97 0.97 0.97 30
```
---
layout: model
title: Word2Vec Embeddings in Uzbek (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, uz, open_source]
task: Embeddings
language: uz
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_uz_3.4.1_3.0_1647465690254.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_uz_3.4.1_3.0_1647465690254.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","uz") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Men Spark NLP ni yaxshi ko'raman"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","uz")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Men Spark NLP ni yaxshi ko'raman").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("uz.embed.w2v_cc_300d").predict("""Men Spark NLP ni yaxshi ko'raman""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|uz|
|Size:|481.9 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Stop Words Cleaner for Slovenian
author: John Snow Labs
name: stopwords_sl
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: sl
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, sl]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_sl_sl_2.5.4_2.4_1594742442155.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_sl_sl_2.5.4_2.4_1594742442155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_sl", "sl") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_sl", "sl")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene."""]
stopword_df = nlu.load('sl.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=3, result='John', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=5, end=8, result='Snow', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=23, end=23, result=',', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=31, end=37, result='severni', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=39, end=43, result='kralj', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_sl|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|sl|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: Financial Finetuned FLAN-T5 Text Generation (FIQA dataset)
author: John Snow Labs
name: fingen_flant5_finetuned_fiqa
date: 2023-05-29
tags: [en, finance, generation, licensed, flant5, fiqa, tensorflow]
task: Text Generation
language: en
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: FinanceTextGenerator
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `fingen_flant5_finetuned_fiqa` model is the Text Generation model that has been fine-tuned on FLAN-T5 using FIQA dataset. FLAN-T5 is a state-of-the-art language model developed by Google AI that utilizes the T5 architecture for text-generation tasks.
References:
```bibtex
@article{flant5_paper,
title={Scaling instruction-finetuned language models},
author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and others},
journal={arXiv preprint arXiv:2210.11416},
year={2022}
}
@article{t5_paper,
title={Exploring the limits of transfer learning with a unified text-to-text transformer},
author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J},
journal={The Journal of Machine Learning Research},
volume={21},
number={1},
pages={5485--5551},
year={2020},
publisher={JMLRORG}
}
```
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/fingen_flant5_finetuned_fiqa_en_1.0.0_3.0_1685363340017.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/fingen_flant5_finetuned_fiqa_en_1.0.0_3.0_1685363340017.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
flant5 = finance.TextGenerator.pretrained("fingen_flant5_finetuned_fiqa", "en", "finance/models")\
.setInputCols(["document"])\
.setOutputCol("generated")\
.setMaxNewTokens(256)\
.setStopAtEos(True)\
.setDoSample(True)\
.setTopK(3)
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])
data = spark.createDataFrame([
[1, "How to have a small capital investment in US if I am out of the country?"]]).toDF('id', 'text')
results = pipeline.fit(data).transform(data)
results.select("generated.result").show(truncate=False)
```
## Results
```bash
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[I would suggest a local broker. They have diversified funds that are diversified and have the same fees as the US market. They also offer diversified portfolios that have the lowest risk.]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|fingen_flant5_finetuned_fiqa|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.6 GB|
## References
The dataset is available [here](https://huggingface.co/datasets/BeIR/fiqa)
---
layout: model
title: Extract Intent Type from Customer Service Chat Messages
author: John Snow Labs
name: finclf_customer_service_intent_type
date: 2023-02-03
tags: [en, licensed, intent, finance, customer, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: FinanceClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Text Classification model that can help you classify a chat message from customer service according to intent type.
## Predicted Entities
`cancel_order`, `change_order`, `change_setup_shipping_address`, `check_cancellation_fee`, `check_payment_methods`, `check_refund_policy`, `complaint`, `contact_customer_service`, `contact_human_agent`, `create_edit_switch_account`, `delete_account`, `delivery_options`, `delivery_period`, `get_check_invoice`, `get_refund`, `newsletter_subscription`, `payment_issue`, `place_order`, `recover_password`, `registration_problems`, `review`, `track_order`, `track_refund`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_customer_service_intent_type_en_1.0.0_3.0_1675427852317.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_customer_service_intent_type_en_1.0.0_3.0_1675427852317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = nlp.UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = finance.ClassifierDLModel.pretrained("finclf_customer_service_intent_type", "en", "finance/models")\
.setInputCols("sentence_embeddings") \
.setOutputCol("class")
pipeline = nlp.Pipeline().setStages(
[
document_assembler,
embeddings,
docClassifier
]
)
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
light_model = nlp.LightPipeline(model)
result = light_model.annotate("""I have a problem with the deletion of my Premium account.""")
result["class"]
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_200000_cased_generator","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_200000_cased_generator","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_electra_base_gc4_64k_200000_cased_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|de|
|Size:|222.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/stefan-it/electra-base-gc4-64k-200000-cased-generator
- https://german-nlp-group.github.io/projects/gc4-corpus.html
- https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf
---
layout: model
title: Arabic Named Entity Recognition (from abdusahmbzuai)
author: John Snow Labs
name: bert_ner_arabert_ner
date: 2022-05-04
tags: [bert, ner, token_classification, ar, open_source]
task: Named Entity Recognition
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `arabert-ner` is a Arabic model orginally trained by `abdusahmbzuai`.
## Predicted Entities
`ORG`, `LOC`, `PER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_arabert_ner_ar_3.4.2_3.0_1651630356143.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_arabert_ner_ar_3.4.2_3.0_1651630356143.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_arabert_ner","ar") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_arabert_ner","ar")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("أنا أحب الشرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.ner.arabert_ner").predict("""أنا أحب الشرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_arabert_ner|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ar|
|Size:|505.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/abdusahmbzuai/arabert-ner
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_by_radhakri119 TFWav2Vec2ForCTC from radhakri119
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab_by_radhakri119
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_radhakri119` is a English model originally trained by radhakri119.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_radhakri119_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_radhakri119_en_4.2.0_3.0_1664101755978.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_radhakri119_en_4.2.0_3.0_1664101755978.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab_by_radhakri119", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab_by_radhakri119", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab_by_radhakri119|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|354.9 MB|
---
layout: model
title: Fast Neural Machine Translation Model from English to Central Bikol
author: John Snow Labs
name: opus_mt_en_bcl
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, bcl, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `bcl`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_bcl_xx_2.7.0_2.4_1609170612440.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_bcl_xx_2.7.0_2.4_1609170612440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_bcl", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_bcl", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.bcl').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_bcl|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from Moussab)
author: John Snow Labs
name: distilbert_qa_base_cased_led_squad_orkg_what_5e_05
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-orkg-what-5e-05` is a English model originally trained by `Moussab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_what_5e_05_en_4.3.0_3.0_1672766856335.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_what_5e_05_en_4.3.0_3.0_1672766856335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_what_5e_05","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_what_5e_05","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_cased_led_squad_orkg_what_5e_05|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Moussab/distilbert-base-cased-distilled-squad-orkg-what-5e-05
---
layout: model
title: Javanese BertForMaskedLM Small Cased model (from w11wo)
author: John Snow Labs
name: bert_embeddings_javanese_small_imdb
date: 2022-12-02
tags: [jv, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: jv
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `javanese-bert-small-imdb` is a Javanese model originally trained by `w11wo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_small_imdb_jv_4.2.4_3.0_1670022513681.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_small_imdb_jv_4.2.4_3.0_1670022513681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_javanese_small_imdb","jv") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_javanese_small_imdb","jv")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_javanese_small_imdb|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|jv|
|Size:|410.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/w11wo/javanese-bert-small-imdb
- https://arxiv.org/abs/1810.04805
- https://github.com/sgugger
- https://w11wo.github.io/
---
layout: model
title: Persian Named Entity Recognition (from HooshvareLab)
author: John Snow Labs
name: bert_ner_bert_base_parsbert_peymaner_uncased
date: 2022-05-09
tags: [bert, ner, token_classification, fa, open_source]
task: Named Entity Recognition
language: fa
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-parsbert-peymaner-uncased` is a Persian model orginally trained by `HooshvareLab`.
## Predicted Entities
`LOC`, `PER`, `TIM`, `MON`, `DAT`, `PCT`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_parsbert_peymaner_uncased_fa_3.4.2_3.0_1652099544405.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_parsbert_peymaner_uncased_fa_3.4.2_3.0_1652099544405.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_parsbert_peymaner_uncased","fa") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["من عاشق جرقه nlp هستم"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_parsbert_peymaner_uncased","fa")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("من عاشق جرقه nlp هستم").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_bert_base_parsbert_peymaner_uncased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|fa|
|Size:|607.0 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/HooshvareLab/bert-base-parsbert-peymaner-uncased
- https://arxiv.org/abs/2005.12515
- http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/
- https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb
- https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb
- https://arxiv.org/abs/2005.12515
- https://tensorflow.org/tfrc
- https://hooshvare.com
- https://www.linkedin.com/in/m3hrdadfi/
- https://twitter.com/m3hrdadfi
- https://github.com/m3hrdadfi
- https://www.linkedin.com/in/mohammad-gharachorloo/
- https://twitter.com/MGharachorloo
- https://github.com/baarsaam
- https://www.linkedin.com/in/marziehphi/
- https://twitter.com/marziehphi
- https://github.com/marziehphi
- https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/
- https://twitter.com/mmanthouri
- https://github.com/mmanthouri
- https://hooshvare.com/
- https://www.linkedin.com/company/hooshvare
- https://twitter.com/hooshvare
- https://github.com/hooshvare
- https://www.instagram.com/hooshvare/
- https://www.linkedin.com/in/sara-tabrizi-64548b79/
- https://www.behance.net/saratabrizi
- https://www.instagram.com/sara_b_tabrizi/
---
layout: model
title: Explain Document Pipeline for Norwegian (Bokmal)
author: John Snow Labs
name: explain_document_sm
date: 2021-03-22
tags: [open_source, norwegian_bokmal, explain_document_sm, pipeline, "no"]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: "no"
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_no_3.0.0_3.0_1616427435939.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_no_3.0.0_3.0_1616427435939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_sm', lang = 'no')
annotations = pipeline.fullAnnotate(""Hei fra John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_sm", lang = "no")
val result = pipeline.fullAnnotate("Hei fra John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hei fra John Snow Labs! ""]
result_df = nlu.load('no.explain').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:--------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Hei fra John Snow Labs! '] | ['Hei fra John Snow Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[-0.394499987363815,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_sm|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|no|
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_ff9000
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-ff9000` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff9000_en_4.3.0_3.0_1675123598242.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff9000_en_4.3.0_3.0_1675123598242.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_ff9000","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_ff9000","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_ff9000|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|110.1 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-ff9000
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English RobertaForMaskedLM Base Cased model
author: John Snow Labs
name: roberta_embeddings_distil_base
date: 2022-12-12
tags: [en, open_source, roberta_embeddings, robertaformaskedlm]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base` is a English model originally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distil_base_en_4.2.4_3.0_1670858593481.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distil_base_en_4.2.4_3.0_1670858593481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_distil_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_distil_base","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_distil_base|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|308.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/distilroberta-base
- https://arxiv.org/abs/1910.01108
- https://aclanthology.org/2021.acl-long.330.pdf
- https://dl.acm.org/doi/pdf/10.1145/3442188.3445922
- https://skylion007.github.io/OpenWebTextCorpus/
- https://mlco2.github.io/impact#compute
- https://arxiv.org/abs/1910.09700
---
layout: model
title: English RobertaForQuestionAnswering Large Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-few-shot-k-1024-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4_en_4.3.0_3.0_1674221411536.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4_en_4.3.0_3.0_1674221411536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-large-few-shot-k-1024-finetuned-squad-seed-4
---
layout: model
title: English BertForQuestionAnswering model (from MrAnderson)
author: John Snow Labs
name: bert_qa_bert_base_1024_full_trivia_copied_embeddings
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-1024-full-trivia-copied-embeddings` is a English model orginally trained by `MrAnderson`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_1024_full_trivia_copied_embeddings_en_4.0.0_3.0_1654179607765.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_1024_full_trivia_copied_embeddings_en_4.0.0_3.0_1654179607765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_1024_full_trivia_copied_embeddings","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_1024_full_trivia_copied_embeddings","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.trivia.bert.base_1024d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_1024_full_trivia_copied_embeddings|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|409.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/MrAnderson/bert-base-1024-full-trivia-copied-embeddings
---
layout: model
title: Legal Transfer Clause Binary Classifier
author: John Snow Labs
name: legclf_transfer_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `transfer` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `transfer`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_transfer_clause_en_1.0.0_3.2_1660124097088.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_transfer_clause_en_1.0.0_3.2_1660124097088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[transfer]|
|[other]|
|[other]|
|[transfer]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_transfer_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.91 0.94 0.93 66
transfer 0.90 0.86 0.88 43
accuracy - - 0.91 109
macro-avg 0.91 0.90 0.90 109
weighted-avg 0.91 0.91 0.91 109
```
---
layout: model
title: Fast Neural Machine Translation Model from Tumbuka to English
author: John Snow Labs
name: opus_mt_tum_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, tum, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `tum`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_tum_en_xx_2.7.0_2.4_1609168643029.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_tum_en_xx_2.7.0_2.4_1609168643029.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_tum_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_tum_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.tum.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_tum_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_bilal_20epoch TFWav2Vec2ForCTC from Roshana
author: John Snow Labs
name: pipeline_asr_wav2vec2_bilal_20epoch
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_bilal_20epoch` is a English model originally trained by Roshana.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_bilal_20epoch_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_bilal_20epoch_en_4.2.0_3.0_1664119706366.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_bilal_20epoch_en_4.2.0_3.0_1664119706366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_bilal_20epoch', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_bilal_20epoch", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_bilal_20epoch|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Fast Neural Machine Translation Model from English to Luba-Lulua
author: John Snow Labs
name: opus_mt_en_lua
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, lua, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `lua`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_lua_xx_2.7.0_2.4_1609164365414.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_lua_xx_2.7.0_2.4_1609164365414.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_lua", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_lua", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.lua').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_lua|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Detect Clinical Entities (jsl_ner_wip_clinical)
author: John Snow Labs
name: jsl_ner_wip_clinical
date: 2021-01-18
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.7.0
spark_version: 2.4
tags: [ner, en, clinical, licensed]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
`Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `I-Age`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `I-Diet`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_en_2.6.5_2.4_1609505628141.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_en_2.6.5_2.4_1609505628141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings_clinical = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \
.setInputCols(['sentence', 'token']) \
.setOutputCol('embeddings')
clinical_ner = NerDLModel.pretrained("jsl_ner_wip_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"]))
```
```scala
...
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("jsl_ner_wip_clinical", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline.
```bash
+-----------------------------------------+----------------------------+
|chunk |ner_label |
+-----------------------------------------+----------------------------+
|21-day-old |Age |
|Caucasian |Race_Ethnicity |
|male |Gender |
|for 2 days |Duration |
|congestion |Symptom |
|mom |Gender |
|yellow |Modifier |
|discharge |Symptom |
|nares |External_body_part_or_region|
|she |Gender |
|mild |Modifier |
|problems with his breathing while feeding|Symptom |
|perioral cyanosis |Symptom |
|retractions |Symptom |
|One day ago |RelativeDate |
|mom |Gender |
|Tylenol |Drug_BrandName |
|Baby |Age |
|decreased p.o. intake |Symptom |
|His |Gender |
+-----------------------------------------+----------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|jsl_ner_wip_clinical|
|Type:|ner|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence,token, embeddings]|
|Output Labels:|[ner]|
|Language:|[en]|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on data gathered and manually annotated by John Snow Labs.
https://www.johnsnowlabs.com/data/
{:.h2_title}
## Benchmarking
```bash
entity tp fp fn total precision recall f1
VS_Finding 235.0 46.0 43.0 278.0 0.8363 0.8453 0.8408
Direction 3972.0 465.0 458.0 4430.0 0.8952 0.8966 0.8959
Respiration 82.0 4.0 4.0 86.0 0.9535 0.9535 0.9535
Cerebrovascular_D... 93.0 20.0 24.0 117.0 0.823 0.7949 0.8087
Family_History_He... 88.0 6.0 3.0 91.0 0.9362 0.967 0.9514
Heart_Disease 447.0 82.0 119.0 566.0 0.845 0.7898 0.8164
RelativeTime 158.0 80.0 59.0 217.0 0.6639 0.7281 0.6945
Strength 624.0 58.0 53.0 677.0 0.915 0.9217 0.9183
Smoking 121.0 11.0 4.0 125.0 0.9167 0.968 0.9416
Medical_Device 3716.0 491.0 466.0 4182.0 0.8833 0.8886 0.8859
Pulse 136.0 22.0 14.0 150.0 0.8608 0.9067 0.8831
Psychological_Con... 135.0 9.0 29.0 164.0 0.9375 0.8232 0.8766
Overweight 2.0 1.0 0.0 2.0 0.6667 1.0 0.8
Triglycerides 3.0 0.0 2.0 5.0 1.0 0.6 0.75
Obesity 42.0 5.0 6.0 48.0 0.8936 0.875 0.8842
Admission_Discharge 318.0 24.0 11.0 329.0 0.9298 0.9666 0.9478
HDL 3.0 0.0 0.0 3.0 1.0 1.0 1.0
Diabetes 110.0 14.0 8.0 118.0 0.8871 0.9322 0.9091
Section_Header 3740.0 148.0 157.0 3897.0 0.9619 0.9597 0.9608
Age 627.0 75.0 48.0 675.0 0.8932 0.9289 0.9107
O2_Saturation 34.0 14.0 17.0 51.0 0.7083 0.6667 0.6869
Kidney_Disease 96.0 12.0 34.0 130.0 0.8889 0.7385 0.8067
Test 2504.0 545.0 498.0 3002.0 0.8213 0.8341 0.8276
Communicable_Disease 21.0 10.0 6.0 27.0 0.6774 0.7778 0.7241
Hypertension 162.0 5.0 10.0 172.0 0.9701 0.9419 0.9558
External_body_par... 2626.0 356.0 413.0 3039.0 0.8806 0.8641 0.8723
Oxygen_Therapy 81.0 15.0 14.0 95.0 0.8438 0.8526 0.8482
Modifier 2341.0 404.0 539.0 2880.0 0.8528 0.8128 0.8324
Test_Result 1007.0 214.0 255.0 1262.0 0.8247 0.7979 0.8111
BMI 9.0 1.0 0.0 9.0 0.9 1.0 0.9474
Labour_Delivery 57.0 23.0 33.0 90.0 0.7125 0.6333 0.6706
Employment 271.0 59.0 55.0 326.0 0.8212 0.8313 0.8262
Fetus_NewBorn 66.0 33.0 51.0 117.0 0.6667 0.5641 0.6111
Clinical_Dept 923.0 110.0 83.0 1006.0 0.8935 0.9175 0.9053
Time 29.0 13.0 16.0 45.0 0.6905 0.6444 0.6667
Procedure 3185.0 462.0 501.0 3686.0 0.8733 0.8641 0.8687
Diet 36.0 20.0 45.0 81.0 0.6429 0.4444 0.5255
Oncological 459.0 61.0 55.0 514.0 0.8827 0.893 0.8878
LDL 3.0 0.0 3.0 6.0 1.0 0.5 0.6667
Symptom 7104.0 1302.0 1200.0 8304.0 0.8451 0.8555 0.8503
Temperature 116.0 6.0 8.0 124.0 0.9508 0.9355 0.9431
Vital_Signs_Header 215.0 29.0 24.0 239.0 0.8811 0.8996 0.8903
Relationship_Status 49.0 2.0 1.0 50.0 0.9608 0.98 0.9703
Total_Cholesterol 11.0 4.0 5.0 16.0 0.7333 0.6875 0.7097
Blood_Pressure 158.0 18.0 22.0 180.0 0.8977 0.8778 0.8876
Injury_or_Poisoning 579.0 130.0 127.0 706.0 0.8166 0.8201 0.8184
Drug_Ingredient 1716.0 153.0 132.0 1848.0 0.9181 0.9286 0.9233
Treatment 136.0 36.0 60.0 196.0 0.7907 0.6939 0.7391
Pregnancy 123.0 36.0 51.0 174.0 0.7736 0.7069 0.7387
Vaccine 13.0 2.0 6.0 19.0 0.8667 0.6842 0.7647
Disease_Syndrome_... 2981.0 559.0 446.0 3427.0 0.8421 0.8699 0.8557
Height 30.0 10.0 15.0 45.0 0.75 0.6667 0.7059
Frequency 595.0 99.0 138.0 733.0 0.8573 0.8117 0.8339
Route 858.0 76.0 89.0 947.0 0.9186 0.906 0.9123
Duration 351.0 99.0 108.0 459.0 0.78 0.7647 0.7723
Death_Entity 43.0 14.0 5.0 48.0 0.7544 0.8958 0.819
Internal_organ_or... 6477.0 972.0 991.0 7468.0 0.8695 0.8673 0.8684
Alcohol 80.0 18.0 13.0 93.0 0.8163 0.8602 0.8377
Substance_Quantity 6.0 7.0 4.0 10.0 0.4615 0.6 0.5217
Date 498.0 38.0 19.0 517.0 0.9291 0.9632 0.9459
Hyperlipidemia 47.0 3.0 3.0 50.0 0.94 0.94 0.94
Social_History_He... 99.0 7.0 7.0 106.0 0.934 0.934 0.934
Race_Ethnicity 116.0 0.0 0.0 116.0 1.0 1.0 1.0
Imaging_Technique 40.0 18.0 47.0 87.0 0.6897 0.4598 0.5517
Drug_BrandName 859.0 62.0 61.0 920.0 0.9327 0.9337 0.9332
RelativeDate 566.0 124.0 143.0 709.0 0.8203 0.7983 0.8091
Gender 6096.0 80.0 101.0 6197.0 0.987 0.9837 0.9854
Dosage 244.0 31.0 57.0 301.0 0.8873 0.8106 0.8472
Form 234.0 32.0 55.0 289.0 0.8797 0.8097 0.8432
Medical_History_H... 114.0 9.0 10.0 124.0 0.9268 0.9194 0.9231
Birth_Entity 4.0 2.0 3.0 7.0 0.6667 0.5714 0.6154
Substance 59.0 8.0 11.0 70.0 0.8806 0.8429 0.8613
Sexually_Active_o... 5.0 3.0 4.0 9.0 0.625 0.5556 0.5882
Weight 90.0 10.0 21.0 111.0 0.9 0.8108 0.8531
macro - - - - - - 0.8148
micro - - - - - - 0.8788
```
---
layout: model
title: Legal Economic Analysis Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_economic_analysis_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, economic_analysis, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_economic_analysis_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Economic_Analysis or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Economic_Analysis`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_economic_analysis_bert_en_1.0.0_3.0_1678111814559.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_economic_analysis_bert_en_1.0.0_3.0_1678111814559.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Economic_Analysis]|
|[Other]|
|[Other]|
|[Economic_Analysis]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_economic_analysis_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.2 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Economic_Analysis 0.88 0.84 0.86 116
Other 0.84 0.88 0.86 111
accuracy - - 0.86 227
macro-avg 0.86 0.86 0.86 227
weighted-avg 0.86 0.86 0.86 227
```
---
layout: model
title: Arabic Bert Embeddings (MARBERT model v2)
author: John Snow Labs
name: bert_embeddings_MARBERTv2
date: 2022-04-11
tags: [bert, embeddings, ar, open_source]
task: Embeddings
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `MARBERTv2` is a Arabic model orginally trained by `UBC-NLP`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_MARBERTv2_ar_3.4.2_3.0_1649678231280.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_MARBERTv2_ar_3.4.2_3.0_1649678231280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_MARBERTv2","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_MARBERTv2","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.MARBERTv2").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_MARBERTv2|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|609.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/UBC-NLP/MARBERTv2
- https://aclanthology.org/2021.acl-long.551.pdf
- https://github.com/UBC-NLP/marbert
- https://doi.org/10.14288/SOCKEYE
- https://www.tensorflow.org/tfrc
---
layout: model
title: English image_classifier_vit_ice_cream ViTForImageClassification from juanfiguera
author: John Snow Labs
name: image_classifier_vit_ice_cream
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_ice_cream` is a English model originally trained by juanfiguera.
## Predicted Entities
`chocolate ice cream`, `vanilla ice cream`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ice_cream_en_4.1.0_3.0_1660170383355.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ice_cream_en_4.1.0_3.0_1660170383355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_ice_cream", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_ice_cream", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_ice_cream|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Detect PHI for Deidentification purposes (Spanish, reduced entities, augmented data)
author: John Snow Labs
name: ner_deid_generic_augmented
date: 2022-02-15
tags: [deid, es, licensed]
task: De-identification
language: es
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities (1 more than the `ner_deid_generic` ner model).
This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset, several data augmentation mechanisms and has been augmented with MEDDOCAN Spanish Deidentification corpus (compared to `ner_deid_generic` which does not include it). It's a generalized version of `ner_deid_subentity_augmented`.
## Predicted Entities
`CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE`, `SEX`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_es_3.3.4_2.4_1644925864218.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_es_3.3.4_2.4_1644925864218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("word_embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner])
text = ['''
Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
''']
df = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(df).transform(df)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner))
val text = "Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."
val df = Seq(text).toDF("text")
val results = pipeline.fit(df).transform(df)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.med_ner.deid.generic_augmented").predict("""
Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_dm256","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_dm256","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_dm256|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|158.7 MB|
## References
- https://huggingface.co/google/t5-efficient-base-dm256
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Cyberbullying Classifier
author: John Snow Labs
name: classifierdl_use_cyberbullying
class: ClassifierDLModel
language: en
nav_key: models
repository: public/models
date: 03/07/2020
task: Text Classification
edition: Spark NLP 2.5.3
spark_version: 2.4
tags: [classifier]
supported: true
annotator: ClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Identify Racism, Sexism or Neutral tweets.
{:.h2_title}
## Predicted Entities
``neutral``, ``racism``, ``sexism``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_CYBERBULLYING/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN_CYBERBULLYING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_cyberbullying_en_2.5.3_2.4_1593783319298.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_cyberbullying_en_2.5.3_2.4_1593783319298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained(lang="en") \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
document_classifier = ClassifierDLModel.pretrained('classifierdl_use_cyberbullying', 'en') \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")
nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate('@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked')
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val use = UniversalSentenceEncoder.pretrained(lang="en")
.setInputCols(Array("document"))
.setOutputCol("sentence_embeddings")
val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_cyberbullying", "en")
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier))
val data = Seq("@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked"""]
cyberbull_df = nlu.load('classify.cyberbullying.use').predict(text, output_level='document')
cyberbull_df[["document", "cyberbullying"]]
```
{:.h2_title}
## Results
```bash
+--------------------------------------------------------------------------------------------------------+------------+
|document |class |
+--------------------------------------------------------------------------------------------------------+------------+
|@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked. | racism |
+--------------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
| Model Name | classifierdl_use_cyberbullying |
| Model Class | ClassifierDLModel |
| Spark Compatibility | 2.5.3 |
| Spark NLP Compatibility | 2.4 |
| License | open source |
| Edition | public |
| Input Labels | [document, sentence_embeddings] |
| Output Labels | [class] |
| Language | en |
| Upstream Dependencies | tfhub_use |
{:.h2_title}
## Data Source
This model is trained on cyberbullying detection dataset. https://raw.githubusercontent.com/dhavalpotdar/cyberbullying-detection/master/data/data/data.csv
{:.h2_title}
## Benchmarking
```bash
precision recall f1-score support
none 0.69 1.00 0.81 3245
racism 0.00 0.00 0.00 568
sexism 0.00 0.00 0.00 922
accuracy 0.69 4735
macro avg 0.23 0.33 0.27 4735
weighted avg 0.47 0.69 0.56 4735
```
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_bert_small_pretrained_finetuned_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-small-pretrained-finetuned-squad` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_pretrained_finetuned_squad_en_4.0.0_3.0_1654184786135.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_pretrained_finetuned_squad_en_4.0.0_3.0_1654184786135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_small_pretrained_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_small_pretrained_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.small_finetuned.by_anas-awadalla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_small_pretrained_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|107.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-small-pretrained-finetuned-squad
---
layout: model
title: Detect Assertion Status (assertion_jsl)
author: John Snow Labs
name: assertion_jsl
date: 2021-07-24
tags: [licensed, clinical, assertion, en]
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 3.1.2
spark_version: 2.4
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The deep neural network architecture for assertion status detection in Spark NLP is based on a BiLSTM framework, and is a modified version of the architecture proposed by Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances, hypothetically present in the patient at some future point, and mentioned in the patient report but associated with someoneelse (Uzuner et al. 2011).
## Predicted Entities
`Present`, `Absent`, `Possible`, `Planned`, `Someoneelse`, `Past`, `Family`, `None`, `Hypotetical`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_jsl_en_3.1.2_2.4_1627139823450.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_jsl_en_3.1.2_2.4_1627139823450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
clinical_assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion])
text="""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
...
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val clinical_assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion))
val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert.jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
The output is a dataframe with a sentence per row and an `assertion` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select `ner_chunk.result` and `assertion.result` from your output dataframe.
+-----------------------------------------+-----+---+----------------------------+-------+---------+
|chunk |begin|end|ner_label |sent_id|assertion|
+-----------------------------------------+-----+---+----------------------------+-------+---------+
|21-day-old |17 |26 |Age |0 |Family |
|Caucasian |28 |36 |Race_Ethnicity |0 |Family |
|male |38 |41 |Gender |0 |Family |
|for 2 days |48 |57 |Duration |0 |Family |
|congestion |62 |71 |Symptom |0 |Present |
|mom |75 |77 |Gender |0 |Family |
|yellow |99 |104|Modifier |0 |Family |
|discharge |106 |114|Symptom |0 |Family |
|nares |135 |139|External_body_part_or_region|0 |Family |
|she |147 |149|Gender |0 |Family |
|mild |168 |171|Modifier |0 |Family |
|problems with his breathing while feeding|173 |213|Symptom |0 |Present |
|perioral cyanosis |237 |253|Symptom |0 |Absent |
|retractions |258 |268|Symptom |0 |Absent |
|One day ago |272 |282|RelativeDate |1 |Family |
|mom |285 |287|Gender |1 |Family |
|Tylenol |345 |351|Drug_BrandName |1 |Family |
|Baby |354 |357|Age |2 |Family |
|decreased p.o. intake |377 |397|Symptom |2 |Family |
|His |400 |402|Gender |3 |Family |
+-----------------------------------------+-----+---+----------------------------+-------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|assertion_jsl|
|Compatibility:|Healthcare NLP 3.1.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, chunk, embeddings]|
|Output Labels:|[assertion]|
|Language:|en|
## Data Source
Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with ‘embeddings_clinical’. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
## Benchmarking
```bash
label prec rec f1
Absent 0.970 0.943 0.956
Someoneelse 0.868 0.775 0.819
Planned 0.721 0.754 0.737
Possible 0.852 0.884 0.868
Past 0.811 0.823 0.817
Present 0.833 0.866 0.849
Family 0.872 0.921 0.896
None 0.609 0.359 0.452
Hypothetical 0.722 0.810 0.763
Macro-average 0.888 0.872 0.880
Micro-average 0.908 0.908 0.908
```
---
layout: model
title: English RobertaForQuestionAnswering (from deepset)
author: John Snow Labs
name: roberta_qa_roberta_large_squad2_hp
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad2-hp` is a English model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_squad2_hp_en_4.0.0_3.0_1655737792807.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_squad2_hp_en_4.0.0_3.0_1655737792807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_squad2_hp","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_large_squad2_hp","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.large.by_deepset").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_large_squad2_hp|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/deepset/roberta-large-squad2-hp
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Gam)
author: John Snow Labs
name: distilbert_qa_base_uncased_cuad_finetuned
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-cuad-distilbert` is a English model originally trained by `Gam`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_cuad_finetuned_en_4.3.0_3.0_1672767889486.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_cuad_finetuned_en_4.3.0_3.0_1672767889486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_cuad_finetuned","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_cuad_finetuned","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_cuad_finetuned|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Gam/distilbert-base-uncased-finetuned-cuad-distilbert
---
layout: model
title: English image_classifier_vit_lawn_weeds ViTForImageClassification from LorenzoDeMattei
author: John Snow Labs
name: image_classifier_vit_lawn_weeds
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_lawn_weeds` is a English model originally trained by LorenzoDeMattei.
## Predicted Entities
`clover`, `dichondra`, `grass`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_lawn_weeds_en_4.1.0_3.0_1660171067931.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_lawn_weeds_en_4.1.0_3.0_1660171067931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_lawn_weeds", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_lawn_weeds", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_lawn_weeds|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Pipeline to Detect radiology concepts (ner_radiology_wip_clinical)
author: John Snow Labs
name: ner_radiology_wip_clinical_pipeline
date: 2023-03-14
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_radiology_wip_clinical](https://nlp.johnsnowlabs.com/2021/04/01/ner_radiology_wip_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_radiology_wip_clinical_pipeline_en_4.3.0_3.2_1678801944623.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_radiology_wip_clinical_pipeline_en_4.3.0_3.2_1678801944623.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_radiology_wip_clinical_pipeline", "en", "clinical/models")
text = '''Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_radiology_wip_clinical_pipeline", "en", "clinical/models")
val text = "Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.radiology.clinical_wip.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""")
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:----------------------|--------:|------:|:--------------------------|-------------:|
| 0 | Bilateral | 0 | 8 | Direction | 0.9828 |
| 1 | breast | 10 | 15 | BodyPart | 0.8169 |
| 2 | ultrasound | 17 | 26 | ImagingTest | 0.6216 |
| 3 | ovoid mass | 78 | 87 | ImagingFindings | 0.6917 |
| 4 | 0.5 x 0.5 x 0.4 | 113 | 127 | Measurements | 0.91524 |
| 5 | cm | 129 | 130 | Units | 0.9987 |
| 6 | anteromedial aspect | 163 | 181 | Direction | 0.8241 |
| 7 | left | 190 | 193 | Direction | 0.4667 |
| 8 | shoulder | 195 | 202 | BodyPart | 0.6349 |
| 9 | mass | 210 | 213 | ImagingFindings | 0.9611 |
| 10 | isoechoic echotexture | 228 | 248 | ImagingFindings | 0.6851 |
| 11 | muscle | 266 | 271 | BodyPart | 0.7805 |
| 12 | internal color flow | 294 | 312 | ImagingFindings | 0.5153 |
| 13 | benign fibrous tissue | 334 | 354 | ImagingFindings | 0.394867 |
| 14 | lipoma | 361 | 366 | Disease_Syndrome_Disorder | 0.9142 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_radiology_wip_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squadv2-recipe-roberta-tokenwise-token-and-step-losses-3-epochs` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs_en_4.3.0_3.0_1674224122519.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs_en_4.3.0_3.0_1674224122519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|467.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/squadv2-recipe-roberta-tokenwise-token-and-step-losses-3-epochs
---
layout: model
title: English BertForQuestionAnswering Tiny Cased model (from M-FAC)
author: John Snow Labs
name: bert_qa_tiny_finetuned_squadv2
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-finetuned-squadv2` is a English model originally trained by `M-FAC`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tiny_finetuned_squadv2_en_4.0.0_3.0_1657188687594.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tiny_finetuned_squadv2_en_4.0.0_3.0_1657188687594.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tiny_finetuned_squadv2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_tiny_finetuned_squadv2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_tiny_finetuned_squadv2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|16.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/M-FAC/bert-tiny-finetuned-squadv2
- https://arxiv.org/pdf/2107.03356.pdf
- https://github.com/IST-DASLab/M-FAC
---
layout: model
title: Detect Units and Measurements in text
author: John Snow Labs
name: ner_measurements_clinical
date: 2021-04-01
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract units and other measurements in reports, prescription and other medical texts using pretrained NER model.
## Predicted Entities
`Units`, `Measurements`
{:.btn-box}
[
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_en_3.0.0_3.0_1617260795877.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_en_3.0.0_3.0_1617260795877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_measurements_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_measurements_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.measurements").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_measurements_clinical|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
---
layout: model
title: Legal Law Area Prediction Classifier (Italian)
author: John Snow Labs
name: legclf_law_area_prediction_italian
date: 2023-03-29
tags: [it, licensed, classification, legal, tensorflow]
task: Text Classification
language: it
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Multiclass classification model which identifies law area labels(civil_law, penal_law, public_law, social_law) in Italian-based Court Cases.
## Predicted Entities
`civil_law`, `penal_law`, `public_law`, `social_law`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_law_area_prediction_italian_it_1.0.0_3.0_1680095983817.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_law_area_prediction_italian_it_1.0.0_3.0_1680095983817.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_multi_cased", "xx")\
.setInputCols(["document"]) \
.setOutputCol("sentence_embeddings")
docClassifier = legal.ClassifierDLModel.pretrained("legclf_law_area_prediction_italian", "it", "legal/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
embeddings,
docClassifier
])
df = spark.createDataFrame([["Per questi motivi, il Tribunale federale pronuncia: 1. Nella misura in cui è ammissibile, il ricorso è respinto. 2. Le spese giudiziarie di fr. 1'000.-- sono poste a carico dei ricorrenti. 3. Comunicazione al patrocinatore dei ricorrenti, al Consiglio di Stato, al Gran Consiglio, al Tribunale amministrativo del Cantone Ticino e all'Ufficio federale dello sviluppo territoriale."]]).toDF("text")
model = nlpPipeline.fit(df)
result = model.transform(df)
result.select("text", "category.result").show(truncate=100)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+------------+
| text| result|
+----------------------------------------------------------------------------------------------------+------------+
|Per questi motivi, il Tribunale federale pronuncia: 1. Nella misura in cui è ammissibile, il rico...|[public_law]|
+----------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_law_area_prediction_italian|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|it|
|Size:|22.3 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/rcds/legal_criticality_prediction)
## Benchmarking
```bash
label precision recall f1-score support
civil_law 0.86 0.86 0.86 58
penal_law 0.85 0.82 0.83 55
public_law 0.79 0.79 0.79 52
social_law 0.93 0.96 0.94 68
accuracy - - 0.86 233
macro-avg 0.86 0.86 0.86 233
weighted-avg 0.86 0.86 0.86 233
```
---
layout: model
title: Finnish asr_wav2vec2_large_uralic_voxpopuli_v2_finnish TFWav2Vec2ForCTC from Finnish-NLP
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_uralic_voxpopuli_v2_finnish` is a Finnish model originally trained by Finnish-NLP.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish_fi_4.2.0_3.0_1664038039563.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish_fi_4.2.0_3.0_1664038039563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish', lang = 'fi')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish", lang = "fi")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English image_classifier_vit_base_patch16_224_in21k_classify_4scence ViTForImageClassification from HaoHu
author: John Snow Labs
name: image_classifier_vit_base_patch16_224_in21k_classify_4scence
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_in21k_classify_4scence` is a English model originally trained by HaoHu.
## Predicted Entities
`City_road`, `fog`, `rain`, `snow`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_classify_4scence_en_4.1.0_3.0_1660171116662.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_classify_4scence_en_4.1.0_3.0_1660171116662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_base_patch16_224_in21k_classify_4scence", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_base_patch16_224_in21k_classify_4scence", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_base_patch16_224_in21k_classify_4scence|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Detect Living Species
author: John Snow Labs
name: bert_token_classifier_ner_living_species
date: 2022-06-26
tags: [en, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract living species from clinical texts which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP.
It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others.
**NOTE :**
- The text files were translated from Spanish with a neural machine translation system.
- The annotations were translated with the same neural machine translation system.
- The translated annotations were transferred to the translated text files using an annotation transfer technology.
## Predicted Entities
`HUMAN`, `SPECIES`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_en_3.5.3_3.0_1656273939035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_en_3.5.3_3.0_1656273939035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("ner")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
.setCaseSensitive(True)
.setMaxSentenceLength(512)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter))
val data = Seq("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.living_species.token_bert").predict("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""")
```
## Results
```bash
+-----------------------+-------+
|ner_chunk |label |
+-----------------------+-------+
|woman |HUMAN |
|bacterial |SPECIES|
|Fusarium spp |SPECIES|
|patient |HUMAN |
|species |SPECIES|
|Fusarium solani complex|SPECIES|
|antifungals |SPECIES|
+-----------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_living_species|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
[https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/)
## Benchmarking
```bash
label precision recall f1-score support
B-HUMAN 0.83 0.96 0.89 2950
B-SPECIES 0.70 0.93 0.80 3129
I-HUMAN 0.73 0.39 0.51 145
I-SPECIES 0.67 0.81 0.74 1166
micro-avg 0.74 0.91 0.82 7390
macro-avg 0.73 0.77 0.73 7390
weighted-avg 0.75 0.91 0.82 7390
```
---
layout: model
title: Pipeline to Resolve ICD-10-CM Codes
author: John Snow Labs
name: icd10cm_resolver_pipeline
date: 2023-04-28
tags: [en, licensed, clinical, resolver, chunk_mapping, pipeline, icd10cm]
task: Entity Resolution
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline maps entities with their corresponding ICD-10-CM codes. You’ll just feed your text and it will return the corresponding ICD-10-CM codes.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_resolver_pipeline_en_4.3.2_3.0_1682726202207.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_resolver_pipeline_en_4.3.2_3.0_1682726202207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
resolver_pipeline = PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models")
text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage"""
result = resolver_pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val resolver_pipeline = new PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models")
val result = resolver_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.icd10cm_resolver.pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""")
```
## Results
```bash
+-----------------------------+---------+------------+
|chunk |ner_chunk|icd10cm_code|
+-----------------------------+---------+------------+
|gestational diabetes mellitus|PROBLEM |O24.919 |
|anisakiasis |PROBLEM |B81.0 |
|fetal and neonatal hemorrhage|PROBLEM |P545 |
+-----------------------------+---------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|icd10cm_resolver_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|3.5 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- ChunkMapperModel
- ChunkMapperModel
- ChunkMapperFilterer
- Chunk2Doc
- BertSentenceEmbeddings
- SentenceEntityResolverModel
- ResolverMerger
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from horsbug98)
author: John Snow Labs
name: xlm_roberta_qa_Part_2_XLM_Model_E1
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_2_XLM_Model_E1` is a English model originally trained by `horsbug98`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_Part_2_XLM_Model_E1_en_4.0.0_3.0_1655983522974.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_Part_2_XLM_Model_E1_en_4.0.0_3.0_1655983522974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_Part_2_XLM_Model_E1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_Part_2_XLM_Model_E1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.tydiqa.xlm_roberta.v2.by_horsbug98").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_Part_2_XLM_Model_E1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|814.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/horsbug98/Part_2_XLM_Model_E1
---
layout: model
title: Word Embeddings for Urdu (urduvec_140M_300d)
author: John Snow Labs
name: urduvec_140M_300d
date: 2020-12-01
task: Embeddings
language: ur
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [embeddings, ur, open_source]
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained using Word2Vec approach on a corpora of 140 Million tokens, has a vocabulary of 100k unique tokens, and gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words.
These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/urduvec_140M_300d_ur_2.7.0_2.4_1606810614734.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/urduvec_140M_300d_ur_2.7.0_2.4_1606810614734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['مجھے سپارک این ایل پی پسند ہے۔']], ["text"]))
```
```scala
val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("مجھے سپارک این ایل پی پسند ہے۔").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["مجھے سپارک این ایل پی پسند ہے۔"]
urduvec_df = nlu.load('ur.embed.urdu_vec_140M_300d').predict(text, output_level="token")
urduvec_df
```
{:.h2_title}
## Results
The model gives 300 dimensional Word2Vec feature vector outputs per token.
```bash
|Embeddings vector | Tokens
|----------------------------------------------------|---------
| [0.15994004905223846, -0.2213257998228073, 0.0... | مجھے
| [-0.16085924208164215, -0.12259697169065475, -... | سپارک
| [-0.07977486401796341, -0.528775691986084, 0.3... | این
| [-0.24136857688426971, -0.15272589027881622, 0... | ایل
| [-0.23666366934776306, -0.16016320884227753, 0... | پی
| [0.07911433279514313, 0.05598200485110283, 0.0... | پسند
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|urduvec_140M_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[word_embeddings]|
|Language:|ur|
|Case sensitive:|false|
|Dimension:|300|
## Data Source
The model is imported from http://www.lrec-conf.org/proceedings/lrec2018/pdf/148.pdf
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_4_h_768
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-768` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_768_zh_4.2.4_3.0_1670021693429.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_768_zh_4.2.4_3.0_1670021693429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_768","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_768","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_4_h_768|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|170.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-4_H-768
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Oncology Pipeline for Biomarkers
author: John Snow Labs
name: oncology_biomarker_pipeline
date: 2022-11-04
tags: [licensed, en, oncology]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP for Healthcare 4.2.2
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline includes Named-Entity Recognition, Assertion Status and Relation Extraction models to extract information from oncology texts. This pipeline focuses on entities related to biomarkers.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_biomarker_pipeline_en_4.2.2_3.0_1667581643291.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_biomarker_pipeline_en_4.2.2_3.0_1667581643291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.")[0]
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models")
val result = pipeline.fullAnnotate("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""")(0)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.oncology_biomarker.pipeline").predict("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""")
```
## Results
```bash
relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence
was_acquired_by ORG 0 13 Whatsapp, Inc. ORG 31 34 Meta 0.9527305
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finre_acquisitions_subsidiaries_md|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|405.7 MB|
## References
In-house annotations on SEC 10K filings and Wikidata
## Benchmarking
```bash
label Recall Precision F1 Support
is_subsidiary_of 0.583 0.618 0.600 36
other 0.975 0.948 0.961 243
was_acquired 0.836 0.895 0.864 61
was_acquired_by 0.767 0.780 0.773 60
Avg. 0.790 0.810 0.800 406
Weighted-Avg. 0.887 0.885 0.886 406
```
---
layout: model
title: French XlmRoBertaForQuestionAnswering (from saattrupdan)
author: John Snow Labs
name: xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan
date: 2022-06-24
tags: [open_source, question_answering, xlmroberta, fr]
task: Question Answering
language: fr
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmr-base-texas-squad-fr` is a French model originally trained by `saattrupdan`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan_fr_4.0.0_3.0_1656066193139.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan_fr_4.0.0_3.0_1656066193139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan","fr") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan","fr")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("fr.answer_question.squad.xlmr_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|fr|
|Size:|873.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/saattrupdan/xlmr-base-texas-squad-fr
---
layout: model
title: Sentence Entity Resolver for Snomed (sbertresolve_snomed_conditions)
author: John Snow Labs
name: sbertresolve_snomed_conditions
date: 2021-08-28
tags: [snomed, licensed, en, clinical]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.1.3
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities (domain: Conditions) to Snomed codes using `sbert_jsl_medium_uncased` Sentence Bert Embeddings.
## Predicted Entities
Snomed Codes and their normalized definition with `sbert_jsl_medium_uncased ` embeddings.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_conditions_en_3.1.3_2.4_1630180858399.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_conditions_en_3.1.3_2.4_1630180858399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings.pretrained('sbert_jsl_medium_uncased', 'en','clinical/models')\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed_conditions", "en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("snomed_code")\
.setDistanceFunction("EUCLIDEAN")
snomed_pipelineModel = PipelineModel(
stages = [
documentAssembler,
sbert_embedder,
snomed_resolver
])
snomed_lp = LightPipeline(snomed_pipelineModel)
result = snomed_lp.fullAnnotate("schizophrenia")
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings.pretrained('sbert_jsl_medium_uncased', 'en','clinical/models')
.setInputCols("ner_chunk")
.setOutputCol("sbert_embeddings")
val snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed_conditions", "en", "clinical/models") \
.setInputCols(Array("sbert_embeddings")) \
.setOutputCol("snomed_code")\
.setDistanceFunction("EUCLIDEAN")
val snomed_pipelineModel = new PipelineModel().setStages(Array(documentAssembler,sbert_embedder,snomed_resolver))
val snomed_lp = LightPipeline(snomed_pipelineModel)
val result = snomed_lp.fullAnnotate("schizophrenia")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.snomed_conditions").predict("""Put your text here.""")
```
## Results
```bash
| | chunks | code | resolutions | all_codes | all_distances |
|---:|:--------------|:---------|:-------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|:-----------------------------------------------------|
| 0 | schizophrenia | 58214004 | [schizophrenia, chronic schizophrenia, borderline schizophrenia, schizophrenia, catatonic, subchronic schizophrenia, ...]| [58214004, 83746006, 274952002, 191542003, 191529003, 16990005, ...] | 0.0000, 0.0774, 0.0838, 0.0927, 0.0970, 0.0970, ...] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbertresolve_snomed_conditions|
|Compatibility:|Healthcare NLP 3.1.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk, sbert_embeddings]|
|Output Labels:|[snomed_code]|
|Language:|en|
|Case sensitive:|false|
---
layout: model
title: Arabic Part of Speech Tagger (Modern Standard Arabic (MSA), Modern Standard Arabic-MSA POS)
author: John Snow Labs
name: bert_pos_bert_base_arabic_camelbert_msa_pos_msa
date: 2022-04-26
tags: [bert, pos, part_of_speech, ar, open_source]
task: Part of Speech Tagging
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-msa-pos-msa` is a Arabic model orginally trained by `CAMeL-Lab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_msa_pos_msa_ar_3.4.2_3.0_1650993589088.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_msa_pos_msa_ar_3.4.2_3.0_1650993589088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_msa_pos_msa","ar") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_msa_pos_msa","ar")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("أنا أحب الشرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.pos.arabic_camelbert_msa_pos_msa").predict("""أنا أحب الشرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_bert_base_arabic_camelbert_msa_pos_msa|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ar|
|Size:|407.1 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-pos-msa
- https://dl.acm.org/doi/pdf/10.5555/1621804.1621808
- https://arxiv.org/abs/2103.06678
- https://github.com/CAMeL-Lab/CAMeLBERT
- https://github.com/CAMeL-Lab/camel_tools
---
layout: model
title: Persian BertForMaskedLM Base Cased model (from HooshvareLab)
author: John Snow Labs
name: bert_embeddings_fa_zwnj_base
date: 2022-12-02
tags: [fa, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: fa
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-fa-zwnj-base` is a Persian model originally trained by `HooshvareLab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_fa_zwnj_base_fa_4.2.4_3.0_1670019440290.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_fa_zwnj_base_fa_4.2.4_3.0_1670019440290.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fa_zwnj_base","fa") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fa_zwnj_base","fa")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_fa_zwnj_base|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|fa|
|Size:|444.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/HooshvareLab/bert-fa-zwnj-base
- https://arxiv.org/abs/2005.12515
- https://github.com/hooshvare/parsbert/issues
---
layout: model
title: XLM-RoBERTa Base NER Pipeline
author: ahmedlone127
name: xlm_roberta_base_token_classifier_ontonotes_pipeline
date: 2022-06-14
tags: [open_source, ner, token_classifier, xlm_roberta, ontonotes, xlm, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: false
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [xlm_roberta_base_token_classifier_ontonotes](https://nlp.johnsnowlabs.com/2021/10/03/xlm_roberta_base_token_classifier_ontonotes_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/xlm_roberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655216428417.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/xlm_roberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655216428417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("xlm_roberta_base_token_classifier_ontonotes_pipeline", lang = "en")
pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.")
```
```scala
val pipeline = new PretrainedPipeline("xlm_roberta_base_token_classifier_ontonotes_pipeline", lang = "en")
pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PERSON |
|John Snow Labs|ORG |
|November 2020 |DATE |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_base_token_classifier_ontonotes_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Community|
|Language:|en|
|Size:|858.4 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- XlmRoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury TFWav2Vec2ForCTC from Satyamatury
author: John Snow Labs
name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury` is a English model originally trained by Satyamatury.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury_en_4.2.0_3.0_1664112233589.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury_en_4.2.0_3.0_1664112233589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18)
author: John Snow Labs
name: distilbert_qa_base_uncased_becas_5
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-5` is a English model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_5_en_4.3.0_3.0_1672767558441.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_5_en_4.3.0_3.0_1672767558441.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_5","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_5","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_becas_5|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-5
---
layout: model
title: Translate Berber to English Pipeline
author: John Snow Labs
name: translate_ber_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, ber, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `ber`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ber_en_xx_2.7.0_2.4_1609687862998.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ber_en_xx_2.7.0_2.4_1609687862998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_ber_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_ber_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.ber.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_ber_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: PICO Classifier
author: John Snow Labs
name: classifierdl_pico_biobert
date: 2020-11-12
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 2.6.2
spark_version: 2.4
tags: [classifier, en, licensed, clinical]
supported: true
annotator: ClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Classify medical text according to PICO framework.
{:.h2_title}
## Predicted Entities
``CONCLUSIONS``, ``DESIGN_SETTING``, ``INTERVENTION``, ``PARTICIPANTS``, ``FINDINGS``, ``MEASUREMENTS``, ``AIMS``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_PICO/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLINICAL_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_pico_biobert_en_2.6.2_2.4_1601901791781.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_pico_biobert_en_2.6.2_2.4_1601901791781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, BertEmbeddings (biobert_pubmed_base_cased), SentenceEmbeddings, ClassifierDLModel.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\
.setInputCols(["document", 'token'])\
.setOutputCol("word_embeddings")
sentence_embeddings = SentenceEmbeddings() \
.setInputCols(["document", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
classifier = ClassifierDLModel.pretrained('classifierdl_pico_biobert', 'en', 'clinical/models')\
.setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('class')
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, sentence_embeddings, classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate(["""A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.""", """When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced."""])
```
```scala
...
val embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')
.setInputCols(Array("document", 'token'))
.setOutputCol("word_embeddings")
val sentence_embeddings = SentenceEmbeddings()
.setInputCols(Array("document", "word_embeddings"))
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")
val classifier = ClassifierDLModel.pretrained('classifierdl_pico_biobert', 'en', 'clinical/models')
.setInputCols(Array('document', 'token', 'sentence_embeddings')).setOutputCol('class')
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, sentence_embeddings, classifier))
val data = Seq("A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.", "When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.pico").predict("""A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.""")
```
{:.h2_title}
## Results
A dictionary containing class labels for each sentence.
```bash
| sentences | class |
|------------------------------------------------------+--------------+
| A total of 10 adult daily smokers who reported at... | PARTICIPANTS |
| When carbamazepine is withdrawn from the combinat... | CONCLUSIONS |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_pico_biobert|
|Type:|ClassifierDLModel|
|Compatibility:|Healthcare NLP 2.6.2 +|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|[en]|
|Case sensitive:|True|
{:.h2_title}
## Data Source
Trained on a custom dataset derived from PICO classification dataset, using ``'biobert_pubmed_base_cased'`` embeddings.
{:.h2_title}
## Benchmarking
```bash
| | labels | precision | recall | f1-score | support |
|---:|---------------:|----------:|---------:|---------:|--------:|
| 0 | AIMS | 0.9197 | 0.9121 | 0.9159 | 3845 |
| 1 | CONCLUSIONS | 0.8426 | 0.8571 | 0.8498 | 4241 |
| 2 | DESIGN_SETTING | 0.7703 | 0.8351 | 0.8014 | 5191 |
| 3 | FINDINGS | 0.9214 | 0.8964 | 0.9088 | 9500 |
| 4 | INTERVENTION | 0.7529 | 0.6758 | 0.7123 | 2597 |
| 5 | MEASUREMENTS | 0.8409 | 0.7734 | 0.8058 | 3500 |
| 6 | PARTICIPANTS | 0.7521 | 0.8548 | 0.8002 | 2396 |
| | accuracy | | 0.8476 | 31270 |
| | macro avg | 0.8286 | 0.8292 | 0.8277 | 31270 |
| | weighted avg | 0.8495 | 0.8476 | 0.8476 | 31270 |
```
---
layout: model
title: Match Chunks in Texts
author: John Snow Labs
name: match_chunks
date: 2022-06-15
tags: [en, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The pipeline uses regex `?/*+`
## Predicted Entities
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/match_chunks_en_4.0.0_3.0_1655322760895.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/match_chunks_en_4.0.0_3.0_1655322760895.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline_local = PretrainedPipeline('match_chunks')
result = pipeline_local.annotate("David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow.")
result['chunk']
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.SparkNLP
SparkNLP.version()
val testData = spark.createDataFrame(Seq( (1, "David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow."))).toDF("id", "text")
val pipeline = PretrainedPipeline("match_chunks", lang="en")
val annotation = pipeline.transform(testData)
annotation.show()
```
{:.nlu-block}
```python
import nlu
nlu.load("en.match.chunks").predict("""David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow.""")
```
## Results
```bash
['the restaurant yesterday',
'family',
'the day',
'that time',
'today',
'the food',
'tomorrow']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|match_chunks|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|4.2 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- PerceptronModel
- Chunker
---
layout: model
title: Word Embeddings for Dutch (dutch_cc_300d)
author: John Snow Labs
name: dutch_cc_300d
date: 2021-10-04
tags: [nl, embeddings, open_source]
task: Embeddings
language: nl
edition: Spark NLP 3.3.0
spark_version: 3.0
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained on Common Crawl and Wikipedia dataset for Dutch language using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words.
These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dutch_cc_300d_nl_3.3.0_3.0_1633366113070.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dutch_cc_300d_nl_3.3.0_3.0_1633366113070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = WordEmbeddingsModel.pretrained("dutch_cc_300d", "nl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
data = spark.createDataFrame([["De Bijlmerramp is de benaming voor de vliegramp"]]).toDF("text")
pipeline_model = nlp_pipeline.fit(data)
result = pipeline_model.transform(data)
```
```scala
val embeddings = WordEmbeddingsModel.pretrained("dutch_cc_300d", "nl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("De Bijlmerramp is de benaming voor de vliegramp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| token | embedding |
|:-------------|:-------------------------------------------------------------------------------|
| De | ['0.0249', '-0.0115', '-0.0748', '-0.0823', '0.0866', '-0.0219', '0.00' ...] |
| Bijlmerramp | ['0.0204', '0.0079', '0.0224', '0.0352', '-0.0409', '0.0053', '0.0175', ...] |
| is | ['-1.0E-4', '0.1419', '0.053', '-0.0921', '0.07', '0.004', '-0.1683', ...] |
| de | ['0.0309', '0.0411', '-0.0077', '-0.0756', '0.0741', '-0.0402', '0.025' ...] |
| benaming | ['0.0197', '0.0167', '-0.0051', '0.0198', '0.034', '-0.0086', '-0.009', ...] |
| voor | ['0.0642', '-0.0171', '-0.0118', '0.0042', '0.0058', '0.0018', '0.0039' ...] |
| de | ['0.0309', '0.0411', '-0.0077', '-0.0756', '0.0741', '-0.0402', '0.025' ...] |
| vliegramp | ['0.083', '0.025', '0.0029', '0.0064', '-0.0698', '0.0344', '-0.0305', ...] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|dutch_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[word_embeddings]|
|Language:|nl|
|Case sensitive:|false|
|Dimension:|300|
## Data Source
This model is imported from https://fasttext.cc/docs/en/crawl-vectors.html
---
layout: model
title: Detect Assertion Status (DL Large)
author: John Snow Labs
name: assertion_dl_large_en
date: 2020-05-21
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 2.5.0
spark_version: 2.4
tags: [ner, en, clinical, licensed]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Deep learning named entity recognition model for assertions. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
``hypothetical``, ``present``, ``absent``, ``possible``, ``conditional``, ``associated_with_someone_else``.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_large_en_2.5.0_2.4_1590022282256.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_large_en_2.5.0_2.4_1590022282256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, NerConverter, AssertionDLModel.
{% include programmingLanguageSelectScalaPython.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
clinical_assertion = AssertionDLModel.pretrained("assertion_dl_large", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion])
model = nlpPipeline.fit(spark.createDataFrame([["Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer."]]).toDF("text"))
light_model = LightPipeline(model)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_large", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, nerConverter, clinical_assertion))
val data = Seq("""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and an ``"assertion"`` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select ``"ner_chunk.result"`` and ``"assertion.result"`` from your output dataframe.
```bash
chunks entities assertion
0 severe fever PROBLEM present
1 sore throat PROBLEM present
2 stomach pain PROBLEM absent
3 an epidural TREATMENT present
4 PCA TREATMENT present
5 pain control PROBLEM present
6 short of breath PROBLEM conditional
7 CT TEST present
8 lung tumor PROBLEM present
9 Alzheimer PROBLEM associated_with_someone_else
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|assertion_dl_large|
|Type:|ner|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence, ner_chunk, embeddings]|
|Output Labels:|[assertion]|
|Language:|[en]|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with 'embeddings_clinical'.
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
{:.h2_title}
## Benchmarking
```bash
prec rec f1
absent 0.97 0.91 0.94
associated_with_someone_else 0.93 0.87 0.90
conditional 0.70 0.33 0.44
hypothetical 0.91 0.82 0.86
possible 0.81 0.59 0.68
present 0.93 0.98 0.95
micro avg 0.93 0.93 0.93
macro avg 0.87 0.75 0.80
```
---
layout: model
title: Pipeline to Detect Clinical Entities (bert_token_classifier_ner_clinical)
author: John Snow Labs
name: bert_token_classifier_ner_clinical_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, berfortokenclassification, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_clinical](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_pipeline_en_3.4.1_3.0_1647888696583.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_pipeline_en_3.4.1_3.0_1647888696583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("bert_token_classifier_ner_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .")
```
```scala
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.clinical_pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""")
```
## Results
```bash
+-----------------------------+---------+
|chunk |ner_label|
+-----------------------------+---------+
|gestational diabetes mellitus|PROBLEM |
|type two diabetes mellitus |PROBLEM |
|T2DM |PROBLEM |
|HTG-induced pancreatitis |PROBLEM |
|an acute hepatitis |PROBLEM |
|obesity |PROBLEM |
|a body mass index |TEST |
|BMI |TEST |
|polyuria |PROBLEM |
|polydipsia |PROBLEM |
|poor appetite |PROBLEM |
|vomiting |PROBLEM |
|amoxicillin |TREATMENT|
|a respiratory tract infection|PROBLEM |
|metformin |TREATMENT|
|glipizide |TREATMENT|
|dapagliflozin |TREATMENT|
|T2DM |PROBLEM |
|atorvastatin |TREATMENT|
|gemfibrozil |TREATMENT|
+-----------------------------+---------+
only showing top 20 rows
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.8 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverter
---
layout: model
title: English BertForQuestionAnswering model (from krinal214)
author: John Snow Labs
name: bert_qa_augmented_Squad_Translated
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `augmented_Squad_Translated` is a English model orginally trained by `krinal214`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_augmented_Squad_Translated_en_4.0.0_3.0_1654179259638.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_augmented_Squad_Translated_en_4.0.0_3.0_1654179259638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_augmented_Squad_Translated","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_augmented_Squad_Translated","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.augmented").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_augmented_Squad_Translated|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/krinal214/augmented_Squad_Translated
---
layout: model
title: Arabic BertForMaskedLM Base Cased model (from aubmindlab)
author: John Snow Labs
name: bert_embeddings_base_arabertv02
date: 2022-12-02
tags: [ar, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ar
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabertv02` is a Arabic model originally trained by `aubmindlab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabertv02_ar_4.2.4_3.0_1670015827071.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabertv02_ar_4.2.4_3.0_1670015827071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabertv02","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabertv02","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_arabertv02|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|507.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/aubmindlab/bert-base-arabertv02
- https://github.com/google-research/bert
- https://arxiv.org/abs/2003.00104
- https://github.com/WissamAntoun/pydata_khobar_meetup
- http://alt.qcri.org/farasa/segmenter.html
- /aubmindlab/bert-base-arabertv02/blob/main/(https://github.com/google-research/bert/blob/master/multilingual.md)
- https://github.com/elnagara/HARD-Arabic-Dataset
- https://www.aclweb.org/anthology/D15-1299
- https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf
- https://github.com/mohamedadaly/LABR
- http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp
- https://github.com/husseinmozannar/SOQAL
- https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md
- https://arxiv.org/abs/2003.00104v2
- https://archive.org/details/arwiki-20190201
- https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4
- https://www.aclweb.org/anthology/W19-4619
- https://sites.aub.edu.lb/mindlab/
- https://www.yakshof.com/#/
- https://www.behance.net/rahalhabib
- https://www.linkedin.com/in/wissam-antoun-622142b4/
- https://twitter.com/wissam_antoun
- https://github.com/WissamAntoun
- https://www.linkedin.com/in/fadybaly/
- https://twitter.com/fadybaly
- https://github.com/fadybaly
---
layout: model
title: English BertForQuestionAnswering Cased model (from DaisyMak)
author: John Snow Labs
name: bert_qa_finetuned_squad_transformerfrozen_testtoken
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-transformerfrozen-testtoken` is a English model originally trained by `DaisyMak`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_finetuned_squad_transformerfrozen_testtoken_en_4.0.0_3.0_1657187107539.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_finetuned_squad_transformerfrozen_testtoken_en_4.0.0_3.0_1657187107539.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_finetuned_squad_transformerfrozen_testtoken","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_finetuned_squad_transformerfrozen_testtoken","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_finetuned_squad_transformerfrozen_testtoken|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/DaisyMak/bert-finetuned-squad-transformerfrozen-testtoken
---
layout: model
title: Arabic Named Entity Recognition (Dialectal Arabic-DA)
author: John Snow Labs
name: bert_ner_bert_base_arabic_camelbert_da_ner
date: 2022-05-04
tags: [bert, ner, token_classification, ar, open_source]
task: Named Entity Recognition
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-da-ner` is a Arabic model orginally trained by `CAMeL-Lab`.
## Predicted Entities
`ORG`, `LOC`, `PERS`, `MISC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_arabic_camelbert_da_ner_ar_3.4.2_3.0_1651630269156.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_arabic_camelbert_da_ner_ar_3.4.2_3.0_1651630269156.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_arabic_camelbert_da_ner","ar") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_arabic_camelbert_da_ner","ar")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("أنا أحب الشرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.ner.arabic_camelbert_da_ner").predict("""أنا أحب الشرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_bert_base_arabic_camelbert_da_ner|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ar|
|Size:|407.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-ner
- https://camel.abudhabi.nyu.edu/anercorp/
- https://arxiv.org/abs/2103.06678
- https://github.com/CAMeL-Lab/CAMeLBERT
- https://github.com/CAMeL-Lab/camel_tools
- https://github.com/CAMeL-Lab/camel_tools
---
layout: model
title: Part of Speech for Hebrew
author: John Snow Labs
name: pos_ud_htb
date: 2020-12-09
task: Part of Speech Tagging
language: he
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [pos, open_source, he]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_htb_he_2.7.0_2.4_1607521333296.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_htb_he_2.7.0_2.4_1607521333296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_htb", "he") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate(["ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה"])
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_htb", "he")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה"]
pos_df = nlu.load('he.pos.ud_htb').predict(text, output_level='token')
pos_df
```
## Results
```bash
{'pos': [Annotation(pos, 0, 0, ADP, {'word': 'ב'}),
Annotation(pos, 1, 1, PUNCT, {'word': '-'}),
Annotation(pos, 3, 4, NUM, {'word': '25'}),
Annotation(pos, 6, 12, VERB, {'word': 'לאוגוסט'}),
Annotation(pos, 14, 16, None, {'word': 'עצר'}),
Annotation(pos, 18, 22, VERB, {'word': 'השב"כ'}),
Annotation(pos, 24, 25, ADP, {'word': 'את'}),
Annotation(pos, 27, 31, PROPN, {'word': 'מוחמד'}),
Annotation(pos, 33, 42, PROPN, {'word': "אבו-ג'וייד"}),
Annotation(pos, 44, 44, PUNCT, {'word': ','}),
Annotation(pos, 46, 49, NOUN, {'word': 'אזרח'}),
Annotation(pos, 51, 55, ADJ, {'word': 'ירדני'}),
Annotation(pos, 57, 57, PUNCT, {'word': ','}),
Annotation(pos, 59, 63, VERB, {'word': 'שגויס'}),
Annotation(pos, 65, 70, ADP, {'word': 'לארגון'}),
Annotation(pos, 72, 76, NOUN, {'word': 'הפת"ח'}),
Annotation(pos, 78, 83, PROPN, {'word': 'והופעל'}),
Annotation(pos, 85, 86, ADP, {'word': 'על'}),
Annotation(pos, 88, 90, NOUN, {'word': 'ידי'}),
Annotation(pos, 92, 99, PROPN, {'word': 'חיזבאללה'})]}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_htb|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[tags, document]|
|Output Labels:|[pos]|
|Language:|he|
## Data Source
The model is trained on data obtained from [https://universaldependencies.org](https://universaldependencies.org)
## Benchmarking
```bash
| | | precision | recall | f1-score | support |
|---:|:-------------|:------------|:---------|-----------:|----------:|
| 0 | ADJ | 0.83 | 0.83 | 0.83 | 676 |
| 1 | ADP | 0.99 | 0.99 | 0.99 | 1889 |
| 2 | ADV | 0.93 | 0.89 | 0.91 | 408 |
| 3 | AUX | 0.90 | 0.90 | 0.9 | 229 |
| 4 | CCONJ | 0.97 | 0.99 | 0.98 | 434 |
| 5 | DET | 0.97 | 0.99 | 0.98 | 1390 |
| 6 | NOUN | 0.91 | 0.94 | 0.93 | 3056 |
| 7 | NUM | 0.97 | 0.96 | 0.97 | 285 |
| 9 | PRON | 0.97 | 0.99 | 0.98 | 443 |
| 10 | PROPN | 0.82 | 0.72 | 0.77 | 573 |
| 11 | PUNCT | 1.00 | 1.00 | 1 | 1381 |
| 12 | SCONJ | 0.99 | 0.90 | 0.94 | 411 |
| 13 | VERB | 0.87 | 0.85 | 0.86 | 1063 |
| 14 | X | 1.00 | 0.17 | 0.29 | 6 |
| 15 | accuracy | | | 0.95 | 15089 |
| 16 | macro avg | 0.94 | 0.87 | 0.89 | 15089 |
| 17 | weighted avg | 0.95 | 0.95 | 0.95 | 15089 |
```
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab53_by_hassnain TFWav2Vec2ForCTC from hassnain
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab53_by_hassnain
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab53_by_hassnain` is a English model originally trained by hassnain.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab53_by_hassnain_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab53_by_hassnain_en_4.2.0_3.0_1664022612371.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab53_by_hassnain_en_4.2.0_3.0_1664022612371.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab53_by_hassnain", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab53_by_hassnain", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab53_by_hassnain|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: Legal Third party beneficiaries Clause Binary Classifier
author: John Snow Labs
name: legclf_third_party_beneficiaries_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `third-party-beneficiaries` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `third-party-beneficiaries`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_third_party_beneficiaries_clause_en_1.0.0_3.2_1660123110554.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_third_party_beneficiaries_clause_en_1.0.0_3.2_1660123110554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[third-party-beneficiaries]|
|[other]|
|[other]|
|[third-party-beneficiaries]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_third_party_beneficiaries_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
no-third-party-beneficiaries 0.96 0.96 0.96 49
other 0.98 0.98 0.98 130
accuracy - - 0.98 179
macro-avg 0.97 0.97 0.97 179
weighted-avg 0.98 0.98 0.98 179
```
---
layout: model
title: Mapping Drug Brand Names with Corresponding National Drug Codes
author: John Snow Labs
name: drug_brandname_ndc_mapper
date: 2022-06-26
tags: [chunk_mapper, ndc, clinical, licensed, en]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps drug brand names to corresponding National Drug Codes (NDC). Product NDCs for each strength are returned in result and metadata.
## Predicted Entities
`Strength_NDC`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/drug_brandname_ndc_mapper_en_3.5.3_3.0_1656260706121.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/drug_brandname_ndc_mapper_en_3.5.3_3.0_1656260706121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("chunk")
chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\
.setInputCols(["chunk"])\
.setOutputCol("ndc")\
.setRels(["Strength_NDC"])\
.setLowerCase(True)
pipeline = Pipeline().setStages([
document_assembler,
chunkerMapper])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_pipeline = LightPipeline(model)
result = light_pipeline.fullAnnotate(["zytiga", "zyvana", "ZYVOX"])
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("chunk")
val chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")
.setInputCols(Array("chunk"))
.setOutputCol("ndc")
.setRels(Array("Strength_NDC"))
.setLowerCase(True)
val pipeline = new Pipeline().setStages(Array(
document_assembler,
chunkerMapper))
val sample_data = Seq("zytiga", "zyvana", "ZYVOX").toDS.toDF("text")
val result = pipeline.fit(sample_data).transform(sample_data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.drug_brand_to_ndc").predict("""Put your text here.""")
```
## Results
```bash
| | Brandname | Strength_NDC |
|---:|:------------|:-------------------------|
| 0 | zytiga | 500 mg/1 | 57894-195 |
| 1 | zyvana | 527 mg/1 | 69336-405 |
| 2 | ZYVOX | 600 mg/300mL | 0009-4992 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|drug_brandname_ndc_mapper|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|3.0 MB|
---
layout: model
title: Fast Neural Machine Translation Model from English to Luba-Katanga
author: John Snow Labs
name: opus_mt_en_lu
date: 2020-12-29
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, lu, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `lu`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_lu_xx_2.7.0_2.4_1609281752399.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_lu_xx_2.7.0_2.4_1609281752399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_lu", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_lu", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.lu').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_lu|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Fast Neural Machine Translation Model from Amharic to Swedish
author: John Snow Labs
name: opus_mt_am_sv
date: 2021-06-01
tags: [open_source, seq2seq, translation, am, sv, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: am
target languages: sv
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_am_sv_xx_3.1.0_2.4_1622554609708.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_am_sv_xx_3.1.0_2.4_1622554609708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_am_sv", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_am_sv", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Amharic.translate_to.Swedish').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_am_sv|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_large_xls_r_thai_test TFWav2Vec2ForCTC from juierror
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_thai_test
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_thai_test` is a English model originally trained by juierror.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_thai_test_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_thai_test_en_4.2.0_3.0_1664024211955.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_thai_test_en_4.2.0_3.0_1664024211955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_thai_test', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_thai_test", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_thai_test|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: ESG Text Classification (3 classes)
author: John Snow Labs
name: finclf_esg
date: 2022-09-06
tags: [en, financial, esg, classification, licensed]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model classifies financial texts / news into three classes: Environment, Social and Governance. This model can be use to build a ESG score board for companies.
If you look for an augmented version of this model, with more fine-grain verticals (Green House Emissions, Business Ethics, etc), please look for the finance_sequence_classifier_augmented_esg model in Models Hub.
## Predicted Entities
`Environment`, `Social`, `Governance`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/FINCLF_ESG/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_esg_en_1.0.0_3.2_1662472406140.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_esg_en_1.0.0_3.2_1662472406140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = nlp.Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = finance.BertForSequenceClassification.pretrained("finclf_esg", "en", "finance/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
# couple of simple examples
example = spark.createDataFrame([["""The Canadian Environmental Assessment Agency (CEAA) concluded that in June 2016 the company had not made an effort
to protect public drinking water and was ignoring concerns raised by its own scientists about the potential levels of pollutants in the local water supply.
At the time, there were concerns that the company was not fully testing onsite wells for contaminants and did not use the proper methods for testing because
of its test kits now manufactured in China.A preliminary report by the company in June 2016 was commissioned by the Alberta government to provide recommendations
to Alberta Environment officials"""]]).toDF("text")
result = pipeline.fit(example).transform(example)
# result is a DataFrame
result.select("text", "class.result").show()
```
## Results
```bash
+--------------------+---------------+
| text| result|
+--------------------+---------------+
|The Canadian Envi...|[Environmental]|
+--------------------+---------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_esg|
|Type:|finance|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|412.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
In-house annotations from scrapped annual reports and tweets about ESG
## Benchmarking
```bash
label precision recall f1-score support
Environmental 0.99 0.97 0.98 97
Social 0.95 0.96 0.95 162
Governance 0.91 0.90 0.91 71
accuracy - - 0.95 330
macro-avg 0.95 0.94 0.95 330
weighted-avg 0.95 0.95 0.95 330
```
---
layout: model
title: Mapping Entities with Corresponding ICD-9-CM Codes
author: John Snow Labs
name: icd9_mapper
date: 2022-09-30
tags: [icd9cm, chunk_mapping, en, licensed, clinical]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps entities with their corresponding ICD-9-CM codes.
## Predicted Entities
`icd9_code`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd9_mapper_en_4.1.0_3.0_1664535522949.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd9_mapper_en_4.1.0_3.0_1664535522949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('doc')
chunk_assembler = Doc2Chunk()\
.setInputCols(['doc'])\
.setOutputCol('ner_chunk')
chunkerMapper = ChunkMapperModel\
.pretrained("icd9_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setRels(["icd9_code"])
mapper_pipeline = Pipeline(stages=[
document_assembler,
chunk_assembler,
chunkerMapper
])
test_data = spark.createDataFrame([["24 completed weeks of gestation"]]).toDF("text")
result = mapper_pipeline.fit(test_data).transform(test_data)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("doc")
val chunk_assembler = Doc2Chunk()
.setInputCols(Array("doc"))
.setOutputCol("ner_chunk")
val chunkerMapper = ChunkMapperModel
.pretrained("icd9_mapper", "en", "clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("mappings")
.setRels(Array("icd9_code"))
val mapper_pipeline = new Pipeline().setStages(Array(
document_assembler,
chunk_assembler,
chunkerMapper))
val test_data = Seq("24 completed weeks of gestation").toDS.toDF("text")
val result = mapper_pipeline.fit(test_data).transform(test_data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.icd9").predict("""24 completed weeks of gestation""")
```
## Results
```bash
+-------------------------------+------------+
|chunk |icd9_mapping|
+-------------------------------+------------+
|24 completed weeks of gestation|765.22 |
+-------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|icd9_mapper|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|374.4 KB|
---
layout: model
title: Entity Recognizer LG
author: John Snow Labs
name: entity_recognizer_lg
date: 2022-06-28
tags: [fi, open_source]
task: Named Entity Recognition
language: fi
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities.
It performs most of the common text processing tasks on your dataframe
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_fi_4.0.0_3.0_1656386552101.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_fi_4.0.0_3.0_1656386552101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("entity_recognizer_lg", "fi")
result = pipeline.annotate("""I love johnsnowlabs! """)
```
{:.nlu-block}
```python
import nlu
nlu.load("fi.ner.lg").predict("""I love johnsnowlabs! """)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|2.5 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- NerDLModel
- NerConverter
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18)
author: John Snow Labs
name: distilbert_qa_base_uncased_becasv2_5
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becasv2-5` is a English model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_5_en_4.3.0_3.0_1672767788818.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_5_en_4.3.0_3.0_1672767788818.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_5","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_5","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_becasv2_5|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Evelyn18/distilbert-base-uncased-becasv2-5
---
layout: model
title: English DistilBertForQuestionAnswering model (from ncduy)
author: John Snow Labs
name: distilbert_qa_base_cased_distilled_squad_finetuned_squad_test
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad-test` is a English model originally trained by `ncduy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_squad_finetuned_squad_test_en_4.0.0_3.0_1654723644873.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_squad_finetuned_squad_test_en_4.0.0_3.0_1654723644873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_squad_finetuned_squad_test","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_squad_finetuned_squad_test","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_cased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_cased_distilled_squad_finetuned_squad_test|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ncduy/distilbert-base-cased-distilled-squad-finetuned-squad-test
---
layout: model
title: Extract Demographic Entities from Oncology Texts
author: John Snow Labs
name: ner_oncology_demographics_wip
date: 2022-09-30
tags: [licensed, clinical, oncology, en, ner, demographics]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts demographic information from oncology texts, including age, gender and smoking status.
Definitions of Predicted Entities:
- `Age`: All mention of ages, past or present, related to the patient or with anybody else.
- `Gender`: Gender-specific nouns and pronouns (including words such as "him" or "she", and family members such as "father").
- `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups.
- `Smoking_Status`: All mentions of smoking related to the patient or to someone else.
## Predicted Entities
`Age`, `Gender`, `Race_Ethnicity`, `Smoking_Status`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_wip_en_4.0.0_3.0_1664563557899.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_wip_en_4.0.0_3.0_1664563557899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")\
.setSplitChars(['-'])
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_demographics_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The patient is a 40-year-old man with history of heavy smoking."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
.setSplitChars("-")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_demographics_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The patient is a 40-year-old man with history of heavy smoking.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_demographics_wip").predict("""The patient is a 40-year-old man with history of heavy smoking.""")
```
## Results
```bash
| chunk | ner_label |
|:------------|:---------------|
| 40-year-old | Age |
| man | Gender |
| smoking | Smoking_Status |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_demographics_wip|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|849.2 KB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Smoking_Status 43.0 12.0 11.0 54.0 0.78 0.80 0.79
Age 679.0 27.0 17.0 696.0 0.96 0.98 0.97
Race_Ethnicity 44.0 7.0 7.0 51.0 0.86 0.86 0.86
Gender 933.0 14.0 8.0 941.0 0.99 0.99 0.99
macro_avg 1699.0 60.0 43.0 1742.0 0.90 0.91 0.90
micro_avg NaN NaN NaN NaN 0.97 0.98 0.97
```
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from doc2query)
author: John Snow Labs
name: t5_yahoo_answers_base_v1
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `yahoo_answers-t5-base-v1` is a English model originally trained by `doc2query`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_yahoo_answers_base_v1_en_4.3.0_3.0_1675158667385.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_yahoo_answers_base_v1_en_4.3.0_3.0_1675158667385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_yahoo_answers_base_v1","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_yahoo_answers_base_v1","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_yahoo_answers_base_v1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|1.0 GB|
## References
- https://huggingface.co/doc2query/yahoo_answers-t5-base-v1
- https://arxiv.org/abs/1904.08375
- https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf
- https://arxiv.org/abs/2104.08663
- https://github.com/UKPLab/beir
- https://www.sbert.net/examples/unsupervised_learning/query_generation/README.html
---
layout: model
title: English BertForQuestionAnswering model (from srmukundb)
author: John Snow Labs
name: bert_qa_srmukundb_bert_base_uncased_finetuned_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `srmukundb`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_srmukundb_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181131257.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_srmukundb_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181131257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_srmukundb_bert_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_srmukundb_bert_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base_uncased.by_srmukundb").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_srmukundb_bert_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/srmukundb/bert-base-uncased-finetuned-squad
---
layout: model
title: English DistilBertForTokenClassification Cased model (from ml6team)
author: John Snow Labs
name: distilbert_token_classifier_keyphrase_extraction_inspec
date: 2023-03-06
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-inspec` is a English model originally trained by `ml6team`.
## Predicted Entities
`KEY`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_inspec_en_4.3.1_3.0_1678133894118.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_inspec_en_4.3.1_3.0_1678133894118.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_inspec","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_inspec","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_keyphrase_extraction_inspec|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ml6team/keyphrase-extraction-distilbert-inspec
- https://dl.acm.org/doi/10.3115/1119355.1119383
- https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=inspec
---
layout: model
title: Word2Vec Embeddings in Azerbaijani (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-14
tags: [cc, embeddings, fastText, word2vec, az, open_source]
task: Embeddings
language: az
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_az_3.4.1_3.0_1647284820457.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_az_3.4.1_3.0_1647284820457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","az") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Qığılcım nlp sevirəm"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","az")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Qığılcım nlp sevirəm").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("az.embed.w2v_cc_300d").predict("""Qığılcım nlp sevirəm""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|az|
|Size:|1.2 GB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Fast Neural Machine Translation Model from Bemba (Zambia) to Spanish
author: John Snow Labs
name: opus_mt_bem_es
date: 2021-06-01
tags: [open_source, seq2seq, translation, bem, es, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: bem
target languages: es
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bem_es_xx_3.1.0_2.4_1622560268464.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bem_es_xx_3.1.0_2.4_1622560268464.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_bem_es", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_bem_es", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Bemba (Zambia).translate_to.Spanish').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_bem_es|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering Mini Uncased model (from ahujaniharika95)
author: John Snow Labs
name: bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minilm-uncased-squad2-finetuned-squad` is a English model originally trained by `ahujaniharika95`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad_en_4.0.0_3.0_1657190376033.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad_en_4.0.0_3.0_1657190376033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|124.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/ahujaniharika95/minilm-uncased-squad2-finetuned-squad
---
layout: model
title: Fast Neural Machine Translation Model from English to Hiligaynon
author: John Snow Labs
name: opus_mt_en_hil
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, hil, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `hil`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_hil_xx_2.7.0_2.4_1609168441233.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_hil_xx_2.7.0_2.4_1609168441233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_hil", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_hil", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.hil').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_hil|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Multilingual (English, German) DistilBertForQuestionAnswering model (from ZYW)
author: John Snow Labs
name: distilbert_qa_en_de_model
date: 2022-06-08
tags: [en, de, open_source, distilbert, question_answering, xx]
task: Question Answering
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `en-de-model` is a English model originally trained by `ZYW`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_en_de_model_xx_4.0.0_3.0_1654728267702.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_en_de_model_xx_4.0.0_3.0_1654728267702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_en_de_model","xx") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_en_de_model","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.answer_question.distil_bert.en_de_tuned.by_ZYW").predict("""PUT YOUR QUESTION HERE|||"PUT YOUR CONTEXT HERE""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_en_de_model|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|xx|
|Size:|505.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ZYW/en-de-model
---
layout: model
title: Legal Registration expenses Clause Binary Classifier
author: John Snow Labs
name: legclf_registration_expenses_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `registration-expenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `registration-expenses`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_registration_expenses_clause_en_1.0.0_3.2_1660122893840.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_registration_expenses_clause_en_1.0.0_3.2_1660122893840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[registration-expenses]|
|[other]|
|[other]|
|[registration-expenses]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_registration_expenses_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 0.99 0.99 86
registration-expenses 0.97 1.00 0.98 30
accuracy - - 0.99 116
macro-avg 0.98 0.99 0.99 116
weighted-avg 0.99 0.99 0.99 116
```
---
layout: model
title: Legal Subscription Agreement Document Classifier (Longformer)
author: John Snow Labs
name: legclf_subscription_agreement
date: 2022-11-10
tags: [en, legal, classification, agreement, subscription, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_subscription_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `subscription-agreement` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`subscription-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subscription_agreement_en_1.0.0_3.0_1668111662925.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subscription_agreement_en_1.0.0_3.0_1668111662925.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[subscription-agreement]|
|[other]|
|[other]|
|[subscription-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_subscription_agreement|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.6 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.99 0.98 0.98 85
subscription-agreement 0.94 0.97 0.95 30
accuracy - - 0.97 115
macro-avg 0.96 0.97 0.97 115
weighted-avg 0.97 0.97 0.97 115
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from hjds0923)
author: John Snow Labs
name: distilbert_qa_hjds0923_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hjds0923`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hjds0923_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771214311.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hjds0923_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771214311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hjds0923_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hjds0923_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_hjds0923_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/hjds0923/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Lemmatizer (French, SpacyLookup)
author: John Snow Labs
name: lemma_spacylookup
date: 2022-03-03
tags: [open_source, lemmatizer, fr]
task: Lemmatization
language: fr
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
annotator: LemmatizerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This French Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_fr_3.4.1_3.0_1646316584024.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_fr_3.4.1_3.0_1646316584024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","fr") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["Tu n'es pas mieux que moi"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","fr")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer))
val data = Seq("Tu n'es pas mieux que moi").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("fr.lemma.spacylookup").predict("""Tu n'es pas mieux que moi""")
```
## Results
```bash
+--------------------------------+
|result |
+--------------------------------+
|[Tu, n'es, pas, mieux, que, moi]|
+--------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma_spacylookup|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|fr|
|Size:|2.4 MB|
---
layout: model
title: English asr_Part1 TFWav2Vec2ForCTC from zasheza
author: John Snow Labs
name: asr_Part1
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Part1` is a English model originally trained by zasheza.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Part1_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Part1_en_4.2.0_3.0_1664039751422.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Part1_en_4.2.0_3.0_1664039751422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_Part1", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_Part1", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_Part1|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: English BertForQuestionAnswering model (from juliusco)
author: John Snow Labs
name: bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert-base-cased-v1.1-squad-finetuned-covdrobert` is a English model orginally trained by `juliusco`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert_en_4.0.0_3.0_1654185648282.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert_en_4.0.0_3.0_1654185648282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.covid_roberta.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|403.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/juliusco/biobert-base-cased-v1.1-squad-finetuned-covdrobert
---
layout: model
title: Fast Neural Machine Translation Model from Hungarian to English
author: John Snow Labs
name: opus_mt_hu_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, hu, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `hu`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_hu_en_xx_2.7.0_2.4_1609168984163.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_hu_en_xx_2.7.0_2.4_1609168984163.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_hu_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_hu_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.hu.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_hu_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: BioBERT Embeddings (Pubmed PMC)
author: John Snow Labs
name: biobert_pubmed_pmc_base_cased
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pubmed_pmc_base_cased_en_2.6.0_2.4_1598343200280.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pubmed_pmc_base_cased_en_2.6.0_2.4_1598343200280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("biobert_pubmed_pmc_base_cased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("biobert_pubmed_pmc_base_cased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I hate cancer").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer"]
embeddings_df = nlu.load('en.embed.biobert.pubmed_pmc_base_cased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_biobert_pubmed_pmc_base_cased_embeddings
I [-0.012962102890014648, 0.27699071168899536, 0...
hate [0.1688309609889984, 0.5337603688240051, 0.148...
cancer [0.1850549429655075, 0.05875205248594284, -0.5...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|biobert_pubmed_pmc_base_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert)
---
layout: model
title: German Electra Embeddings (from stefan-it)
author: John Snow Labs
name: electra_embeddings_electra_base_gc4_64k_800000_cased_generator
date: 2022-05-17
tags: [de, open_source, electra, embeddings]
task: Embeddings
language: de
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-800000-cased-generator` is a German model orginally trained by `stefan-it`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_800000_cased_generator_de_3.4.4_3.0_1652786505011.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_800000_cased_generator_de_3.4.4_3.0_1652786505011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_800000_cased_generator","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_800000_cased_generator","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_electra_base_gc4_64k_800000_cased_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|de|
|Size:|222.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/stefan-it/electra-base-gc4-64k-800000-cased-generator
- https://german-nlp-group.github.io/projects/gc4-corpus.html
- https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf
---
layout: model
title: English Bert Embeddings (from nlp4good)
author: John Snow Labs
name: bert_embeddings_psych_search
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `psych-search` is a English model orginally trained by `nlp4good`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_psych_search_en_3.4.2_3.0_1649672127720.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_psych_search_en_3.4.2_3.0_1649672127720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_psych_search","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_psych_search","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.psych_search").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_psych_search|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|412.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/nlp4good/psych-search
- https://meshb.nlm.nih.gov/treeView
- https://meshb.nlm.nih.gov/record/ui?ui=D000072339
- https://meshb.nlm.nih.gov/record/ui?ui=D005006
- https://meshb.nlm.nih.gov/treeView
- http://bioasq.org/
---
layout: model
title: English asr_wav2vec2_large_english TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_english
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_english` is a English model originally trained by jonatasgrosman.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_english_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_english_en_4.2.0_3.0_1664020317451.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_english_en_4.2.0_3.0_1664020317451.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_english', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_english", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_english|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Word2Vec Embeddings in Norwegian (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: ["no", open_source]
task: Embeddings
language: "no"
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_no_3.4.1_3.0_1647448666485.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_no_3.4.1_3.0_1647448666485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","no") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Jeg elsker gnist nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","no")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Jeg elsker gnist nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("no.embed.w2v_cc_300d").predict("""Jeg elsker gnist nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|no|
|Size:|1.2 GB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Extract Cancer Therapies and Granular Posology Information
author: John Snow Labs
name: ner_oncology_posology
date: 2022-10-25
tags: [licensed, clinical, oncology, en, ner, treatment, posology]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP for Healthcare 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts cancer therapies (Cancer_Surgery, Radiotherapy and Cancer_Therapy) and posology information at a granular level.
Definitions of Predicted Entities:
- `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment.
- `Cancer_Therapy`: Any cancer treatment mentioned in text, excluding surgeries and radiotherapy.
- `Cycle_Count`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles").
- `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5").
- `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle").
- `Dosage`: The quantity prescribed by the physician for an active ingredient.
- `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks").
- `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid").
- `Radiotherapy`: Terms that indicate the use of Radiotherapy.
- `Radiation_Dose`: Dose used in radiotherapy.
- `Route`: Words indicating the type of administration route (such as "PO" or "transdermal").
## Predicted Entities
`Cancer_Surgery`, `Cancer_Therapy`, `Cycle_Count`, `Cycle_Day`, `Cycle_Number`, `Dosage`, `Duration`, `Frequency`, `Radiotherapy`, `Radiation_Dose`, `Route`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_en_4.0.0_3.0_1666728701834.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_en_4.0.0_3.0_1666728701834.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_posology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_posology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_posology").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""")
```
## Results
```bash
| chunk | ner_label |
|:-----------------|:---------------|
| adriamycin | Cancer_Therapy |
| 60 mg/m2 | Dosage |
| cyclophosphamide | Cancer_Therapy |
| 600 mg/m2 | Dosage |
| six courses | Cycle_Count |
| second cycle | Cycle_Number |
| chemotherapy | Cancer_Therapy |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_posology|
|Compatibility:|Spark NLP for Healthcare 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|34.3 MB|
|Dependencies:|embeddings_clinical|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Cycle_Number 52 4 45 97 0.93 0.54 0.68
Cycle_Count 200 63 30 230 0.76 0.87 0.81
Radiotherapy 255 16 55 310 0.94 0.82 0.88
Cancer_Surgery 592 66 227 819 0.90 0.72 0.80
Cycle_Day 175 22 73 248 0.89 0.71 0.79
Frequency 337 44 90 427 0.88 0.79 0.83
Route 53 1 60 113 0.98 0.47 0.63
Cancer_Therapy 1448 81 250 1698 0.95 0.85 0.90
Duration 525 154 236 761 0.77 0.69 0.73
Dosage 858 79 202 1060 0.92 0.81 0.86
Radiation_Dose 86 4 40 126 0.96 0.68 0.80
macro_avg 4581 534 1308 5889 0.90 0.72 0.79
micro_avg 4581 534 1308 5889 0.90 0.78 0.83
```
---
layout: model
title: English asr_english_filipino_wav2vec2_l_xls_r_test_09 TFWav2Vec2ForCTC from Khalsuu
author: John Snow Labs
name: asr_english_filipino_wav2vec2_l_xls_r_test_09
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_09` is a English model originally trained by Khalsuu.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_english_filipino_wav2vec2_l_xls_r_test_09_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_09_en_4.2.0_3.0_1664119314205.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_09_en_4.2.0_3.0_1664119314205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_09", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_09", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_english_filipino_wav2vec2_l_xls_r_test_09|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Detect Sentences in Healthcare Texts
author: John Snow Labs
name: sentence_detector_dl_healthcare
date: 2021-08-11
tags: [licensed, clinical, en, sentence_detection]
task: Sentence Detection
language: en
nav_key: models
edition: Healthcare NLP 3.2.0
spark_version: 3.0
supported: true
annotator: SentenceDetectorDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.
## Predicted Entities
Breaks text in sentences.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/SENTENCE_DETECTOR_HC/){:.button.button-orange.button-orange-trans.arr.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/20.SentenceDetectorDL_Healthcare.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_3.2.0_3.0_1628678815210.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_3.2.0_3.0_1628678815210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentences")
text = """He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv.
Size: Prostate gland measures 10x1.1x 4.9 cm (LS x AP x TS). Estimated volume is
51.9 ml. , and is mildly enlarged in size.Normal delineation pattern of the prostate gland is preserved.
"""
sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL]))
result = sd_model.fullAnnotate(text)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val pipeline = new Pipeline().setStages(Array(documenter, model))
val data = Seq("""He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv.
Size: Prostate gland measures 10x1.1x 4.9 cm (LS x AP x TS). Estimated volume is
51.9 ml. , and is mildly enlarged in size.Normal delineation pattern of the prostate gland is preserved.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.detect_sentence.clinical").predict("""He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv.
Size: Prostate gland measures 10x1.1x 4.9 cm (LS x AP x TS). Estimated volume is
51.9 ml. , and is mildly enlarged in size.Normal delineation pattern of the prostate gland is preserved.
""")
```
## Results
```bash
| | sentences |
|---:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety. |
| 1 | Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv. |
| 2 | Size: Prostate gland measures 10x1.1x 4.9 cm (LS x AP x TS). |
| 3 | Estimated volume is |
| | 51.9 ml. , and is mildly enlarged in size. |
| 4 | Normal delineation pattern of the prostate gland is preserved. |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sentence_detector_dl_healthcare|
|Compatibility:|Healthcare NLP 3.2.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[sentences]|
|Language:|en|
## Data Source
Healthcare SDDL model is trained on in-house domain specific data.
## Benchmarking
```bash
label Accuracy Recall Prec F1
0 0.98 1.00 0.96 0.98
```
---
layout: model
title: Fast Neural Machine Translation Model from Haitian Creole to English
author: John Snow Labs
name: opus_mt_ht_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, ht, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `ht`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ht_en_xx_2.7.0_2.4_1609166270703.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ht_en_xx_2.7.0_2.4_1609166270703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ht_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ht_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.ht.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ht_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Sales Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_sales_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, sales, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Sales` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Sales`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sales_bert_en_1.0.0_3.0_1678049898342.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sales_bert_en_1.0.0_3.0_1678049898342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Sales]|
|[Other]|
|[Other]|
|[Sales]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_sales_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.94 0.89 0.91 54
Sales 0.84 0.91 0.88 35
accuracy - - 0.90 89
macro-avg 0.89 0.90 0.90 89
weighted-avg 0.90 0.90 0.90 89
```
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `recipe_triplet_recipes-roberta-base_EASY_squadv2_epochs_3` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3_en_4.3.0_3.0_1674212163630.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3_en_4.3.0_3.0_1674212163630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|467.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/recipe_triplet_recipes-roberta-base_EASY_squadv2_epochs_3
---
layout: model
title: English asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent TFWav2Vec2ForCTC from creynier
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent` is a English model originally trained by creynier.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent_en_4.2.0_3.0_1664042475596.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent_en_4.2.0_3.0_1664042475596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|349.4 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Chinese Bert Embeddings (Base, Plus, Wobert model)
author: John Snow Labs
name: bert_embeddings_wobert_chinese_plus_base
date: 2022-04-11
tags: [bert, embeddings, zh, open_source]
task: Embeddings
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `wobert_chinese_plus_base` is a Chinese model orginally trained by `junnyu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_wobert_chinese_plus_base_zh_3.4.2_3.0_1649669510103.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_wobert_chinese_plus_base_zh_3.4.2_3.0_1649669510103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_wobert_chinese_plus_base","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_wobert_chinese_plus_base","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.wobert_chinese_plus_base").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_wobert_chinese_plus_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|467.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/junnyu/wobert_chinese_plus_base
- https://github.com/ZhuiyiTechnology/WoBERT
- https://github.com/JunnYu/WoBERT_pytorch
---
layout: model
title: Smaller BERT Embeddings (L-8_H-128_A-2)
author: John Snow Labs
name: small_bert_L8_128
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L8_128_en_2.6.0_2.4_1598344352001.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L8_128_en_2.6.0_2.4_1598344352001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("small_bert_L8_128", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("small_bert_L8_128", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert.small_L8_128').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_bert_small_L8_128_embeddings
I [1.8417736291885376, 0.29461684823036194, -0.3...
love [2.903827428817749, 0.6693897247314453, -0.338...
NLP [1.8207342624664307, 0.1299048662185669, -1.94...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|small_bert_L8_128|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|128|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1)
---
layout: model
title: Lemmatization from BSC/projecte_aina lookups
author: cayorodriguez
name: lemmatizer_bsc
date: 2022-07-07
tags: [ca, open_source]
task: Lemmatization
language: ca
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: false
recommended: true
annotator: LemmatizerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Lemmatizer using lookup tables from `BSC/projecte_aina` sources. This Lemmatizer should work with specific tokenization rules included in the Python usage section.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/cayorodriguez/lemmatizer_bsc_ca_3.4.4_3.0_1657199421685.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/cayorodriguez/lemmatizer_bsc_ca_3.4.4_3.0_1657199421685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
ex_list = ["aprox\.","pàg\.","p\.ex\.","gen\.","feb\.","abr\.","jul\.","set\.","oct\.","nov\.","dec\.","dr\.","dra\.","sr\.","sra\.","srta\.","núm\.","st\.","sta\.","pl\.","etc\.", "ex\."]
ex_list_all = []
ex_list_all.extend(ex_list)
ex_list_all.extend([x[0].upper() + x[1:] for x in ex_list])
ex_list_all.extend([x.upper() for x in ex_list])
tokenizer = Tokenizer() \
.setInputCols(['sentence']).setOutputCol('token')\
.setInfixPatterns(["(d|D)(els)","(d|D)(el)","(a|A)(ls)","(a|A)(l)","(p|P)(els)","(p|P)(el)",\
"([A-zÀ-ú_@]+)(-[A-zÀ-ú_@]+)",\
"(d'|D')([·A-zÀ-ú@_]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|'|,)+","(l'|L')([·A-zÀ-ú_]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|'|,)+", \
"(l'|l'|s'|s'|d'|d'|m'|m'|n'|n'|D'|D'|L'|L'|S'|S'|N'|N'|M'|M')([A-zÀ-ú_]+)",\
"""([A-zÀ-ú·]+)(\.|,|\)|\?|!|;|\:|\"|”)(\.|,|\)|\?|!|;|\:|\"|”)+""",\
"([A-zÀ-ú·]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(\.|,|;|:|\?|,)+",\
"([A-zÀ-ú·]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)",\
"(\.|\"|;|:|!|\?|\-|\(|\)|”|“|')+([0-9A-zÀ-ú_]+)",\
"([0-9A-zÀ-ú·]+)(\.|\"|;|:|!|\?|\(|\)|”|“|'|,|%)",\
"(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)+([0-9]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)+",\
"(d'|D'|l'|L')([·A-zÀ-ú@_]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)", \
"([\.|\"|;|:|!|\?|\-|\(|\)|”|“|,]+)([\.|\"|;|:|!|\?|\-|\(|\)|”|“|,]+)"]) \
.setExceptions(ex_list_all).fit(data)
lemmatizer = LemmatizerModel.pretrained("lemmatizer_bsc","ca") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["Bons dies, al mati"]], ["text"])
results = pipeline.fit(example).transform(example)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemmatizer_bsc|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Community|
|Input Labels:|[form]|
|Output Labels:|[lemma]|
|Language:|ca|
|Size:|7.3 MB|
---
layout: model
title: English asr_Dansk_wav2vec21 TFWav2Vec2ForCTC from Siyam
author: John Snow Labs
name: pipeline_asr_Dansk_wav2vec21
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Dansk_wav2vec21` is a English model originally trained by Siyam.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Dansk_wav2vec21_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Dansk_wav2vec21_en_4.2.0_3.0_1664118614170.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Dansk_wav2vec21_en_4.2.0_3.0_1664118614170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_Dansk_wav2vec21', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_Dansk_wav2vec21", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_Dansk_wav2vec21|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English RoBERTa Embeddings (Large, Wikipedia and Bookcorpus datasets)
author: John Snow Labs
name: roberta_embeddings_muppet_roberta_large
date: 2022-04-14
tags: [roberta, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muppet-roberta-large` is a English model orginally trained by `facebook`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_muppet_roberta_large_en_3.4.2_3.0_1649946679876.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_muppet_roberta_large_en_3.4.2_3.0_1649946679876.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_muppet_roberta_large","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_muppet_roberta_large","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.muppet_roberta_large").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_muppet_roberta_large|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|849.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/facebook/muppet-roberta-large
- https://arxiv.org/abs/2101.11038
---
layout: model
title: Portuguese Named Entity Recognition (from m-lin20)
author: John Snow Labs
name: bert_ner_satellite_instrument_bert_NER
date: 2022-05-09
tags: [bert, ner, token_classification, pt, open_source]
task: Named Entity Recognition
language: pt
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `satellite-instrument-bert-NER` is a Portuguese model orginally trained by `m-lin20`.
## Predicted Entities
`satellite`, `instrument`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_satellite_instrument_bert_NER_pt_3.4.2_3.0_1652098534939.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_satellite_instrument_bert_NER_pt_3.4.2_3.0_1652098534939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_satellite_instrument_bert_NER","pt") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_satellite_instrument_bert_NER","pt")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Eu amo Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_satellite_instrument_bert_NER|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|pt|
|Size:|1.2 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/m-lin20/satellite-instrument-bert-NER
- https://github.com/Tsinghua-mLin/satellite-instrument-NER
---
layout: model
title: German T5ForConditionalGeneration Small Cased model (from aiassociates)
author: John Snow Labs
name: t5_small_grammar_correction
date: 2023-01-31
tags: [de, open_source, t5, tensorflow]
task: Text Generation
language: de
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-grammar-correction-german` is a German model originally trained by `aiassociates`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_grammar_correction_de_4.3.0_3.0_1675126287089.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_grammar_correction_de_4.3.0_3.0_1675126287089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_small_grammar_correction","de") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_small_grammar_correction","de")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_small_grammar_correction|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|de|
|Size:|288.7 MB|
## References
- https://huggingface.co/aiassociates/t5-small-grammar-correction-german
- https://github.com/EricFillion/happy-transformer
- https://www.ai.associates/
- https://www.linkedin.com/company/ai-associates
---
layout: model
title: Sentence Embeddings - sbert tiny (tuned)
author: John Snow Labs
name: sbert_jsl_tiny_umls_uncased
date: 2021-06-30
tags: [embeddings, clinical, licensed, en]
task: Embeddings
language: en
nav_key: models
edition: Healthcare NLP 3.1.0
spark_version: 2.4
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained to generate contextual sentence embeddings of input sentences.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_tiny_umls_uncased_en_3.1.0_2.4_1625050224767.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_tiny_umls_uncased_en_3.1.0_2.4_1625050224767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_tiny_umls_uncased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings")
```
```scala
val sbiobert_embeddings = BertSentenceEmbeddings
.pretrained("sbert_jsl_tiny_umls_uncased","en","clinical/models")
.setInputCols(Array("sentence"))
.setOutputCol("sbert_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed_sentence.bert.jsl_tiny_umls_uncased").predict("""Put your text here.""")
```
## Results
```bash
Gives a 768 dimensional vector representation of the sentence.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbert_jsl_tiny_umls_uncased|
|Compatibility:|Healthcare NLP 3.1.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Case sensitive:|false|
## Data Source
Tuned on MedNLI and UMLS dataset
## Benchmarking
```bash
MedNLI Score
Acc 0.616
STS(cos) 0.632
```
---
layout: model
title: Spanish RobertaForQuestionAnswering (from hackathon-pln-es)
author: John Snow Labs
name: roberta_qa_roberta_base_bne_squad2_hackathon_pln
date: 2022-06-21
tags: [es, open_source, question_answering, roberta]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-squad2-es` is a Spanish model originally trained by `hackathon-pln-es`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_bne_squad2_hackathon_pln_es_4.0.0_3.0_1655790288763.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_bne_squad2_hackathon_pln_es_4.0.0_3.0_1655790288763.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_bne_squad2_hackathon_pln","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_bne_squad2_hackathon_pln","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squadv2.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_bne_squad2_hackathon_pln|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|es|
|Size:|456.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/hackathon-pln-es/roberta-base-bne-squad2-es
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from aszidon)
author: John Snow Labs
name: distilbert_qa_custom
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom` is a English model originally trained by `aszidon`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom_en_4.3.0_3.0_1672774581586.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom_en_4.3.0_3.0_1672774581586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_custom|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aszidon/distilbertcustom
---
layout: model
title: English image_classifier_vit_croupier_creature_classifier ViTForImageClassification from alkzar90
author: John Snow Labs
name: image_classifier_vit_croupier_creature_classifier
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_croupier_creature_classifier` is a English model originally trained by alkzar90.
## Predicted Entities
`elf`, `goblin`, `knight`, `zombie`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_croupier_creature_classifier_en_4.1.0_3.0_1660171498624.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_croupier_creature_classifier_en_4.1.0_3.0_1660171498624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_croupier_creature_classifier", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_croupier_creature_classifier", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_croupier_creature_classifier|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Entity Resolver for Human Phenotype Ontology
author: John Snow Labs
name: sbiobertresolve_HPO
date: 2021-05-05
tags: [en, licensed, clinical, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.2
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps phenotypic abnormalities encountered in human diseases to Human Phenotype Ontology (HPO) codes.
## Predicted Entities
This model returns Human Phenotype Ontology (HPO) codes for phenotypic abnormalities encountered in human diseases. It also returns associated codes from the following vocabularies for each HPO code:
- MeSH (Medical Subject Headings)
- SNOMED
- UMLS (Unified Medical Language System )
- ORPHA (international reference resource for information on rare diseases and orphan drugs)
- OMIM (Online Mendelian Inheritance in Man)
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_HPO/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_HPO_en_3.0.2_3.0_1620235451661.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_HPO_en_3.0.2_3.0_1620235451661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_HPO``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_human_phenotype_gene_clinical``` as NER model. No need to ```.setWhiteList()```.
## Results
```bash
| | chunk | entity | resolution | aux_codes |
|---:|:-----------------|:---------|:-------------|:-----------------------------------------------------------------------------|
| 0 | cancer | HP | HP:0002664 | MSH:D009369||SNOMED:108369006,363346000||UMLS:C0006826,C0027651||ORPHA:1775 |
| 1 | bipolar disorder | HP | HP:0007302 | MSH:D001714||SNOMED:13746004||UMLS:C0005586||ORPHA:370079 |
| 2 | schizophrenia | HP | HP:0100753 | MSH:D012559||SNOMED:191526005,58214004||UMLS:C0036341||ORPHA:231169 |
| 3 | autism | HP | HP:0000717 | MSH:D001321||SNOMED:408856003,408857007,43614003||UMLS:C0004352||ORPHA:79279 |
| 4 | myopia | HP | HP:0000545 | MSH:D009216||SNOMED:57190000||UMLS:C0027092||ORPHA:370022 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_HPO|
|Compatibility:|Healthcare NLP 3.0.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[hpo_code]|
|Language:|en|
---
layout: model
title: Legal Exclusivity Clause Binary Classifier
author: John Snow Labs
name: legclf_exclusivity_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `exclusivity` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `exclusivity`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exclusivity_clause_en_1.0.0_3.2_1660122411802.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exclusivity_clause_en_1.0.0_3.2_1660122411802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[exclusivity]|
|[other]|
|[other]|
|[exclusivity]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_exclusivity_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.1 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
exclusivity 0.79 0.61 0.69 36
other 0.86 0.93 0.90 92
accuracy - - 0.84 128
macro-avg 0.82 0.77 0.79 128
weighted-avg 0.84 0.84 0.84 128
```
---
layout: model
title: English BertForQuestionAnswering model (from batterydata)
author: John Snow Labs
name: bert_qa_batterydata_bert_base_uncased_squad_v1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad-v1` is a English model orginally trained by `batterydata`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_batterydata_bert_base_uncased_squad_v1_en_4.0.0_3.0_1654181357717.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_batterydata_bert_base_uncased_squad_v1_en_4.0.0_3.0_1654181357717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_batterydata_bert_base_uncased_squad_v1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_batterydata_bert_base_uncased_squad_v1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad_battery.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_batterydata_bert_base_uncased_squad_v1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/batterydata/bert-base-uncased-squad-v1
- https://github.com/ShuHuang/batterybert
---
layout: model
title: English asr_wav2vec2_xls_r_300m_kh TFWav2Vec2ForCTC from kongkeaouch
author: John Snow Labs
name: asr_wav2vec2_xls_r_300m_kh
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_kh` is a English model originally trained by kongkeaouch.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_kh_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_kh_en_4.2.0_3.0_1664025079738.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_kh_en_4.2.0_3.0_1664025079738.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xls_r_300m_kh", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xls_r_300m_kh", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xls_r_300m_kh|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Legal Representations Clause Binary Classifier
author: John Snow Labs
name: legclf_representations_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `representations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `representations`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_representations_clause_en_1.0.0_3.2_1660122946365.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_representations_clause_en_1.0.0_3.2_1660122946365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[representations]|
|[other]|
|[other]|
|[representations]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_representations_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.2 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.98 0.98 0.98 645
representations 0.95 0.94 0.95 212
accuracy - - 0.97 857
macro-avg 0.97 0.96 0.96 857
weighted-avg 0.97 0.97 0.97 857
```
---
layout: model
title: Norwegian BertForTokenClassification Base Cased model (from Kushtrim)
author: John Snow Labs
name: bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner
date: 2022-11-30
tags: ["no", open_source, bert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: "no"
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-norsk-ner` is a Norwegian model originally trained by `Kushtrim`.
## Predicted Entities
`MISC`, `LOC`, `ORG`, `PER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner_no_4.2.4_3.0_1669815034218.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner_no_4.2.4_3.0_1669815034218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner","no") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner","no")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|no|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Kushtrim/bert-base-multilingual-cased-finetuned-norsk-ner
---
layout: model
title: Pipeline to Extract Neurologic Deficits Related to Stroke Scale (NIHSS)
author: John Snow Labs
name: ner_nihss_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, en]
task: [Named Entity Recognition, Pipeline Healthcare]
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_nihss](https://nlp.johnsnowlabs.com/2021/11/15/ner_nihss_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_NIHSS/){:.button.button-orange.button-orange-trans.arr.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_NIHSS.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_nihss_pipeline_en_3.4.1_3.0_1647871076449.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_nihss_pipeline_en_3.4.1_3.0_1647871076449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_nihss_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_nihss_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.nihss_pipeline").predict("""Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently.""")
```
## Results
```bash
| | chunk | entity |
|---:|:-------------------|:-------------------------|
| 0 | NIH stroke scale | NIHSS |
| 1 | 23 to 24 | Measurement |
| 2 | one | Measurement |
| 3 | consciousness | 1a_LOC |
| 4 | two | Measurement |
| 5 | month and year | 1b_LOCQuestions |
| 6 | two | Measurement |
| 7 | eye / grip | 1c_LOCCommands |
| 8 | one | Measurement |
| 9 | two | Measurement |
| 10 | gaze | 2_BestGaze |
| 11 | two | Measurement |
| 12 | face | 4_FacialPalsy |
| 13 | eight | Measurement |
| 14 | one | Measurement |
| 15 | limited ataxia | 7_LimbAtaxia |
| 16 | one to two | Measurement |
| 17 | sensory | 8_Sensory |
| 18 | three | Measurement |
| 19 | best language | 9_BestLanguage |
| 20 | two | Measurement |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_nihss_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Stop Words Cleaner for Portuguese
author: John Snow Labs
name: stopwords_pt
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: pt
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, pt]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_pt_pt_2.5.4_2.4_1594742441703.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_pt_pt_2.5.4_2.4_1594742441703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_pt", "pt") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_pt", "pt")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica."""]
stopword_df = nlu.load('pt.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=14, end=16, result='rei', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=21, end=25, result='norte', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=26, end=26, result=',', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=28, end=31, result='John', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=33, end=36, result='Snow', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_pt|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|pt|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from aszidon)
author: John Snow Labs
name: distilbert_qa_custom2
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom2` is a English model originally trained by `aszidon`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom2_en_4.3.0_3.0_1672774614830.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom2_en_4.3.0_3.0_1672774614830.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_custom2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aszidon/distilbertcustom2
---
layout: model
title: Spanish BertForQuestionAnswering model (from MMG)
author: John Snow Labs
name: bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac
date: 2022-06-02
tags: [es, open_source, question_answering, bert]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-spa-squad2-es-finetuned-sqac` is a Spanish model orginally trained by `MMG`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac_es_4.0.0_3.0_1654180469657.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac_es_4.0.0_3.0_1654180469657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squadv2_sqac.bert.base_cased_spa.by_MMG").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|410.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/MMG/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es-finetuned-sqac
---
layout: model
title: Legal Tax returns Clause Binary Classifier
author: John Snow Labs
name: legclf_tax_returns_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `tax-returns` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `tax-returns`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_tax_returns_clause_en_1.0.0_3.2_1660123065637.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_tax_returns_clause_en_1.0.0_3.2_1660123065637.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[tax-returns]|
|[other]|
|[other]|
|[tax-returns]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_tax_returns_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.5 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
precision recall f1-score support
other 1.00 0.92 0.96 37
tax-returns 0.75 1.00 0.86 9
accuracy - - 0.93 46
macro-avg 0.88 0.96 0.91 46
weighted-avg 0.95 0.93 0.94 46
```
---
layout: model
title: Pipeline to Detect Units and Measurements in text
author: John Snow Labs
name: ner_measurements_clinical_pipeline
date: 2023-03-14
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_measurements_clinical](https://nlp.johnsnowlabs.com/2021/04/01/ner_measurements_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_pipeline_en_4.3.0_3.2_1678832259909.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_pipeline_en_4.3.0_3.2_1678832259909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_measurements_clinical_pipeline", "en", "clinical/models")
text = '''Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_measurements_clinical_pipeline", "en", "clinical/models")
val text = "Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.clinical_measurements.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""")
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:----------------|--------:|------:|:-------------|-------------:|
| 0 | 0.5 x 0.5 x 0.4 | 113 | 127 | Measurements | 0.98748 |
| 1 | cm | 129 | 130 | Units | 0.9996 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_measurements_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Fast Neural Machine Translation Model from East Slavic Languages to English
author: John Snow Labs
name: opus_mt_zle_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, zle, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `zle`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_zle_en_xx_2.7.0_2.4_1609166964113.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_zle_en_xx_2.7.0_2.4_1609166964113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_zle_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_zle_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.zle.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_zle_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Financial News Summarization (X-Small)
author: John Snow Labs
name: finsum_news_xs
date: 2022-11-23
tags: [financial, summarization, en, licensed]
task: Summarization
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Financial news Summarizer, finetuned with a extra-small financial dataset. (about 4K news).
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finsum_news_xs_en_1.0.0_3.0_1669213220483.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finsum_news_xs_en_1.0.0_3.0_1669213220483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
t5 = nlp.T5Transformer() \
.pretrained("finsum_news_xs" ,"en", "finance/models") \
.setTask("summarize:")\ # or 'summarization'
.setMaxOutputLength(512)\
.setInputCols(["documents"]) \
.setOutputCol("summaries")
data_df = spark.createDataFrame([["Deere Grows Sales 37% as Shipments Rise. Farm equipment supplier forecasts higher sales in year ahead, lifted by price increases and infrastructure investments. Deere & Co. said its fiscal fourth-quarter sales surged 37% as supply constraints eased and the company shipped more of its farm and construction equipment. The Moline, Ill.-based company, the largest supplier of farm equipment in the U.S., said demand held up as it raised prices on farm equipment, and forecast sales gains in the year ahead. Chief Executive John May cited strong demand and increased investment in infrastructure projects as the Biden administration ramps up spending. Elevated crop prices have kept farmers interested in new machinery even as their own production expenses increase."]]).toDF("text")
pipeline = nlp.Pipeline().setStages([document_assembler, t5])
results = pipeline.fit(data_df).transform(data_df)
results.select("summaries.result").show(truncate=False)
```
## Results
```bash
Deere & Co. said its fiscal fourth-quarter sales surged 37% as supply constraints eased and the company shipped more farm and construction equipment. Deere & Co. said its fiscal fourth-quarter sales surged 37% as supply constraints eased and the company shipped more farm and construction equipment.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finsum_news_xs|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[summaries]|
|Language:|en|
|Size:|923.2 MB|
## Benchmarking
```bash
John Snow Labs in-house summarized articles.
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from Thitaree)
author: John Snow Labs
name: distilbert_qa_Thitaree_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Thitaree`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Thitaree_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724750492.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Thitaree_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724750492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Thitaree_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Thitaree_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Thitaree").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_Thitaree_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Thitaree/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Polish BertForQuestionAnswering model (from henryk)
author: John Snow Labs
name: bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2
date: 2022-06-02
tags: [pl, open_source, question_answering, bert]
task: Question Answering
language: pl
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-polish-squad2` is a Polish model orginally trained by `henryk`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2_pl_4.0.0_3.0_1654180123880.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2_pl_4.0.0_3.0_1654180123880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2","pl") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2","pl")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("pl.answer_question.squadv2.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|pl|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/henryk/bert-base-multilingual-cased-finetuned-polish-squad2
- https://www.linkedin.com/in/henryk-borzymowski-0755a2167/
- https://rajpurkar.github.io/SQuAD-explorer/
- https://github.com/google-research/bert/blob/master/multilingual.md
---
layout: model
title: Finnish asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP TFWav2Vec2ForCTC from Finnish-NLP
author: John Snow Labs
name: pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP` is a Finnish model originally trained by Finnish-NLP.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664040198920.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664040198920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP', lang = 'fi')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP", lang = "fi")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Translate Oromo to English Pipeline
author: John Snow Labs
name: translate_om_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, om, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `om`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_om_en_xx_2.7.0_2.4_1609690273713.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_om_en_xx_2.7.0_2.4_1609690273713.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_om_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_om_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.om.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_om_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Detect Anatomical Structures (Single Entity - biobert)
author: John Snow Labs
name: ner_anatomy_coarse_biobert_pipeline
date: 2023-03-20
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_anatomy_coarse_biobert](https://nlp.johnsnowlabs.com/2021/03/31/ner_anatomy_coarse_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_pipeline_en_4.3.0_3.2_1679316528376.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_pipeline_en_4.3.0_3.2_1679316528376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_anatomy_coarse_biobert_pipeline", "en", "clinical/models")
text = '''content in the lung tissue'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_anatomy_coarse_biobert_pipeline", "en", "clinical/models")
val text = "content in the lung tissue"
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.anatomy_coarse_biobert.pipeline").predict("""content in the lung tissue""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------|--------:|------:|:------------|-------------:|
| 0 | lung tissue | 15 | 25 | Anatomy | 0.99155 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_anatomy_coarse_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.2 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Fast Neural Machine Translation Model from English to Malagasy
author: John Snow Labs
name: opus_mt_en_mg
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, mg, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `mg`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_mg_xx_2.7.0_2.4_1609167829744.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_mg_xx_2.7.0_2.4_1609167829744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_mg", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_mg", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.mg').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_mg|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from avioo1)
author: John Snow Labs
name: roberta_qa_avioo1_base_squad2_finetuned_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `avioo1`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_avioo1_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219191405.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_avioo1_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219191405.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_avioo1_base_squad2_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_avioo1_base_squad2_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_avioo1_base_squad2_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/avioo1/roberta-base-squad2-finetuned-squad
---
layout: model
title: Pipeline to Detect Drug Information (Small)
author: John Snow Labs
name: ner_posology_small_pipeline
date: 2023-03-15
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_posology_small](https://nlp.johnsnowlabs.com/2021/03/31/ner_posology_small_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_small_pipeline_en_4.3.0_3.2_1678868910811.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_small_pipeline_en_4.3.0_3.2_1678868910811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_posology_small_pipeline", "en", "clinical/models")
text = '''The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_posology_small_pipeline", "en", "clinical/models")
val text = "The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.posology_small.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_German_MedBERT","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_German_MedBERT","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Funken NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.embed.medbert").predict("""Ich liebe Funken NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_German_MedBERT|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|409.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/smanjil/German-MedBERT
- https://opus4.kobv.de/opus4-rhein-waal/frontdoor/index/index/searchtype/collection/id/16225/start/0/rows/10/doctypefq/masterthesis/docId/740
- https://www.linkedin.com/in/manjil-shrestha-038527b4/
---
layout: model
title: French CamemBert Embeddings (from DoyyingFace)
author: John Snow Labs
name: camembert_embeddings_DoyyingFace_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `DoyyingFace`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_DoyyingFace_generic_model_fr_3.4.4_3.0_1653986003984.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_DoyyingFace_generic_model_fr_3.4.4_3.0_1653986003984.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_DoyyingFace_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_DoyyingFace_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_DoyyingFace_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/DoyyingFace/dummy-model
---
layout: model
title: Hindi Named Entity Recognition (from sagorsarker)
author: John Snow Labs
name: bert_ner_codeswitch_hineng_ner_lince
date: 2022-05-09
tags: [bert, ner, token_classification, hi, open_source]
task: Named Entity Recognition
language: hi
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `codeswitch-hineng-ner-lince` is a Hindi model orginally trained by `sagorsarker`.
## Predicted Entities
`PERSON`, `ORGANISATION`, `PLACE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_hineng_ner_lince_hi_3.4.2_3.0_1652097576639.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_hineng_ner_lince_hi_3.4.2_3.0_1652097576639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_hineng_ner_lince","hi") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["मुझे स्पार्क एनएलपी बहुत पसंद है"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_hineng_ner_lince","hi")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("मुझे स्पार्क एनएलपी बहुत पसंद है").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_codeswitch_hineng_ner_lince|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|hi|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/sagorsarker/codeswitch-hineng-ner-lince
- https://ritual.uh.edu/lince/home
- https://github.com/sagorbrur/codeswitch
---
layout: model
title: English image_classifier_vit_rust_image_classification_8 ViTForImageClassification from SummerChiam
author: John Snow Labs
name: image_classifier_vit_rust_image_classification_8
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_8` is a English model originally trained by SummerChiam.
## Predicted Entities
`nonrust`, `rust`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_8_en_4.1.0_3.0_1660166811431.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_8_en_4.1.0_3.0_1660166811431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_rust_image_classification_8", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_rust_image_classification_8", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_rust_image_classification_8|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Sentence Entity Resolver for ICD10-CM (Augmented)
author: John Snow Labs
name: sbiobertresolve_icd10cm_augmented
date: 2021-10-31
tags: [icd10cm, entity_resolution, clinical, en, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.1
spark_version: 2.4
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, it has been augmented with synonyms for making it more accurate.
## Predicted Entities
`ICD10CM Codes`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_2.4_1635684621243.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_2.4_1635684621243.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver])
data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text")
results = nlpPipeline.fit(data_ner).transform(data_ner)
```
```scala
...
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val chunk2doc = new Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10cm.augmented").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""")
```
## Results
```bash
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ner_chunk| entity|icd10cm_code| resolutions| all_codes|
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| gestational diabetes mellitus|PROBLEM| O2441|gestational diabetes mellitus:::postpartum gestational diabetes mel...| O2441:::O2443:::Z8632:::Z875:::O2431:::O2411:::O244:::O241:::O2481|
|subsequent type two diabetes mellitus|PROBLEM| O2411|pre-existing type 2 diabetes mellitus:::disorder associated with ty...|O2411:::E118:::E11:::E139:::E119:::E113:::E1144:::Z863:::Z8639:::E1...|
| T2DM|PROBLEM| E11|type 2 diabetes mellitus:::disorder associated with type 2 diabetes...|E11:::E118:::E119:::O2411:::E109:::E139:::E113:::E8881:::Z833:::D64...|
| HTG-induced pancreatitis|PROBLEM| K8520|alcohol-induced pancreatitis:::drug-induced acute pancreatitis:::he...|K8520:::K853:::K8590:::F102:::K852:::K859:::K8580:::K8591:::K858:::...|
| acute hepatitis|PROBLEM| K720|acute hepatitis:::acute hepatitis a:::acute infectious hepatitis:::...|K720:::B15:::B179:::B172:::Z0389:::B159:::B150:::B16:::K752:::K712:...|
| obesity|PROBLEM| E669|obesity:::abdominal obesity:::obese:::central obesity:::overweight ...|E669:::E668:::Z6841:::Q130:::E66:::E6601:::Z8639:::E349:::H3550:::Z...|
| a body mass index|PROBLEM| Z6841|finding of body mass index:::observation of body mass index:::mass ...|Z6841:::E669:::R229:::Z681:::R223:::R221:::Z68:::R222:::R220:::R418...|
| polyuria|PROBLEM| R35|polyuria:::nocturnal polyuria:::polyuric state:::polyuric state (di...|R35:::R3581:::R358:::E232:::R31:::R350:::R8299:::N401:::E723:::O048...|
| polydipsia|PROBLEM| R631|polydipsia:::psychogenic polydipsia:::primary polydipsia:::psychoge...|R631:::F6389:::E232:::F639:::O40:::G475:::M7989:::R632:::R061:::H53...|
| poor appetite|PROBLEM| R630|poor appetite:::poor feeding:::bad taste in mouth:::unpleasant tast...|R630:::P929:::R438:::R432:::E86:::R196:::F520:::Z724:::R0689:::Z768...|
| vomiting|PROBLEM| R111|vomiting:::intermittent vomiting:::vomiting symptoms:::periodic vom...| R111:::R11:::R1110:::G43A1:::P921:::P9209:::G43A:::R1113:::R110|
| a respiratory tract infection|PROBLEM| J988|respiratory tract infection:::upper respiratory tract infection:::b...|J988:::J069:::A499:::J22:::J209:::Z593:::T17:::J0410:::Z1383:::J189...|
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10cm_augmented|
|Compatibility:|Healthcare NLP 3.3.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icd10cm_code]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on ICD10CM 2022 Codes dataset: https://www.cdc.gov/nchs/icd/icd10cm.htm
---
layout: model
title: Italian T5ForConditionalGeneration Small Cased model (from it5)
author: John Snow Labs
name: t5_it5_efficient_small_el32_headline_generation
date: 2023-01-30
tags: [it, open_source, t5, tensorflow]
task: Text Generation
language: it
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-headline-generation` is a Italian model originally trained by `it5`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_headline_generation_it_4.3.0_3.0_1675103295731.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_headline_generation_it_4.3.0_3.0_1675103295731.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_headline_generation","it") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_headline_generation","it")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_it5_efficient_small_el32_headline_generation|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|it|
|Size:|594.0 MB|
## References
- https://huggingface.co/it5/it5-efficient-small-el32-headline-generation
- https://github.com/stefan-it
- https://arxiv.org/abs/2203.03759
- https://gsarti.com
- https://malvinanissim.github.io
- https://arxiv.org/abs/2109.10686
- https://github.com/gsarti/it5
- https://paperswithcode.com/sota?task=Headline+generation&dataset=HeadGen-IT
---
layout: model
title: Recognize Entities DL pipeline for Italian - Large
author: John Snow Labs
name: entity_recognizer_lg
date: 2021-03-23
tags: [open_source, italian, entity_recognizer_lg, pipeline, it]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: it
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_it_3.0.0_3.0_1616465464186.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_it_3.0.0_3.0_1616465464186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('entity_recognizer_lg', lang = 'it')
annotations = pipeline.fullAnnotate(""Ciao da John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("entity_recognizer_lg", lang = "it")
val result = pipeline.fullAnnotate("Ciao da John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Ciao da John Snow Labs! ""]
result_df = nlu.load('it.ner.lg').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Ciao da John Snow Labs! '] | ['Ciao da John Snow Labs!'] | ['Ciao', 'da', 'John', 'Snow', 'Labs!'] | [[-0.238279998302459,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|it|
---
layout: model
title: Translate Punjabi (Eastern) to English Pipeline
author: John Snow Labs
name: translate_pa_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, pa, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `pa`
- target languages: `en`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/INDIAN_TRANSLATION_PUNJABI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TRANSLATION_PIPELINES_MODELS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_pa_en_xx_2.7.0_2.4_1609690246774.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_pa_en_xx_2.7.0_2.4_1609690246774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
result = pipeline = PretrainedPipeline("translate_pa_en", lang = "xx")
pipeline.annotate("ਮੈਨੂੰ ਪੜ੍ਹਨਾ ਪਸੰਦ ਹੈ.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_pa_en", lang = "xx")
val result = pipeline.annotate("ਮੈਨੂੰ ਪੜ੍ਹਨਾ ਪਸੰਦ ਹੈ.")
```
{:.nlu-block}
```python
import nlu
text = ["ਮੈਨੂੰ ਪੜ੍ਹਨਾ ਪਸੰਦ ਹੈ."]
translate_df = nlu.load('xx.pa.translate_to.en').predict(text, output_level='sentence')
translate_df
```
## Results
```bash
+------------------------------+---------------------------+
|sentence |translation |
+------------------------------+---------------------------+
|ਮੈਨੂੰ ਪੜ੍ਹਨਾ ਪਸੰਦ ਹੈ. |I like reading. |
+------------------------------+---------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_pa_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from usami)
author: John Snow Labs
name: distilbert_qa_usami_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `usami`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_usami_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772969630.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_usami_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772969630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_usami_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_usami_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_usami_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/usami/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from anu24)
author: John Snow Labs
name: distilbert_qa_anu24_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `anu24`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_anu24_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769853124.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_anu24_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769853124.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anu24_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anu24_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_anu24_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anu24/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Extract Anatomical Entities from Oncology Texts
author: John Snow Labs
name: ner_oncology_anatomy_general_wip
date: 2022-09-30
tags: [licensed, clinical, oncology, en, ner, anatomy]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts anatomical entities using an unspecific label.
Definitions of Predicted Entities:
- `Anatomical_Site`: Relevant anatomical terms mentioned in text.
- `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower".
## Predicted Entities
`Anatomical_Site`, `Direction`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_wip_en_4.0.0_3.0_1664562237279.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_wip_en_4.0.0_3.0_1664562237279.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_anatomy_general_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_anatomy_general_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_anatomy_general").predict("""The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.""")
```
## Results
```bash
| chunk | ner_label |
|:--------|:----------------|
| left | Direction |
| breast | Anatomical_Site |
| lungs | Anatomical_Site |
| liver | Anatomical_Site |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_anatomy_general_wip|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|843.0 KB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Anatomical_Site 2377.0 649.0 353.0 2730.0 0.79 0.87 0.83
Direction 668.0 219.0 66.0 734.0 0.75 0.91 0.82
macro_avg 3045.0 868.0 419.0 3464.0 0.77 0.89 0.83
micro_avg NaN NaN NaN NaN 0.78 0.88 0.83
```
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from Zamachi)
author: John Snow Labs
name: roberta_qa_for_question_answering
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-for-question-answering` is a English model originally trained by `Zamachi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_for_question_answering_en_4.3.0_3.0_1674220787682.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_for_question_answering_en_4.3.0_3.0_1674220787682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_for_question_answering","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_for_question_answering","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_for_question_answering|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|466.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Zamachi/roberta-for-question-answering
---
layout: model
title: English image_classifier_vit_animal_classifier ViTForImageClassification from ritheshSree
author: John Snow Labs
name: image_classifier_vit_animal_classifier
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_animal_classifier` is a English model originally trained by ritheshSree.
## Predicted Entities
`cat`, `dog`, `snake`, `tiger`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_animal_classifier_en_4.1.0_3.0_1660170154919.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_animal_classifier_en_4.1.0_3.0_1660170154919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_animal_classifier", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_animal_classifier", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_animal_classifier|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Detect Anatomical Structures (Single Entity - biobert)
author: John Snow Labs
name: ner_anatomy_coarse_biobert
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
An NER model to extract all types of anatomical references in text using "biobert_pubmed_base_cased" embeddings. It is a single entity model and generalizes all anatomical references to a single entity.
## Predicted Entities
`Anatomy`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_en_3.0.0_3.0_1617209714335.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_en_3.0.0_3.0_1617209714335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_anatomy_coarse_biobert", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["content in the lung tissue"]], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_anatomy_coarse_biobert", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter))
val data = Seq("""content in the lung tissue""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.anatomy.coarse_biobert").predict("""content in the lung tissue""")
```
## Results
```bash
| | ner_chunk | entity |
|---:|:------------------|:----------|
| 0 | lung tissue | Anatomy |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_anatomy_coarse_biobert|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
Trained on a custom dataset using 'biobert_pubmed_base_cased'.
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|--------------:|------:|------:|------:|---------:|---------:|---------:|
| 0 | B-Anatomy | 2499 | 155 | 162 | 0.941598 | 0.939121 | 0.940357 |
| 1 | I-Anatomy | 1695 | 116 | 158 | 0.935947 | 0.914733 | 0.925218 |
| 2 | Macro-average | 4194 | 271 | 320 | 0.938772 | 0.926927 | 0.932812 |
| 3 | Micro-average | 4194 | 271 | 320 | 0.939306 | 0.929109 | 0.93418 |
```
---
layout: model
title: MeSH to UMLS Code Mapping
author: John Snow Labs
name: mesh_umls_mapping
date: 2021-05-04
tags: [mesh, umls, en, licensed]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 3.0.2
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline maps MeSH codes to UMLS codes without using any text data. You’ll just feed white space-delimited MeSH codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_3.0.2_3.0_1620134296251.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_3.0.2_3.0_1620134296251.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("mesh_umls_mapping","en","clinical/models")
pipeline.annotate("C028491 D019326 C579867")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("mesh_umls_mapping","en","clinical/models")
val result = pipeline.annotate("C028491 D019326 C579867")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.mesh.umls").predict("""C028491 D019326 C579867""")
```
## Results
```bash
{'mesh': ['C028491', 'D019326', 'C579867'],
'umls': ['C0970275', 'C0886627', 'C3696376']}
Note:
| MeSH | Details |
| ---------- | ----------------------------:|
| C028491 | 1,3-butylene glycol |
| D019326 | 17-alpha-Hydroxyprogesterone |
| C579867 | 3-Methylglutaconic Aciduria |
| UMLS | Details |
| ---------- | ---------------------------:|
| C0970275 | 1,3-butylene glycol |
| C0886627 | 17-hydroxyprogesterone |
| C3696376 | 3-methylglutaconic aciduria |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|mesh_umls_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.0.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
## Included Models
- DocumentAssembler
- TokenizerModel
- LemmatizerModel
- Finisher
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from google)
author: John Snow Labs
name: t5_efficient_base_nl2
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nl2` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl2_en_4.3.0_3.0_1675113785226.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl2_en_4.3.0_3.0_1675113785226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_nl2","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_nl2","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_nl2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|158.2 MB|
## References
- https://huggingface.co/google/t5-efficient-base-nl2
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English RobertaForQuestionAnswering (from huxxx657)
author: John Snow Labs
name: roberta_qa_roberta_base_finetuned_deletion_squad_10
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-deletion-squad-10` is a English model originally trained by `huxxx657`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_deletion_squad_10_en_4.0.0_3.0_1655733844546.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_deletion_squad_10_en_4.0.0_3.0_1655733844546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_deletion_squad_10","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_finetuned_deletion_squad_10","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_deletion_10.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_finetuned_deletion_squad_10|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/huxxx657/roberta-base-finetuned-deletion-squad-10
---
layout: model
title: Social Determinants of Health
author: John Snow Labs
name: ner_sdoh_wip
date: 2023-02-11
tags: [licensed, clinical, en, social_determinants, ner, public_health, sdoh]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.8
spark_version: 3.0
supported: true
recommended: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts terminology related to Social Determinants of Health from various kinds of biomedical documents.
## Predicted Entities
`Other_SDoH_Keywords`, `Education`, `Population_Group`, `Quality_Of_Life`, `Housing`, `Substance_Frequency`, `Smoking`, `Eating_Disorder`, `Obesity`, `Healthcare_Institution`, `Financial_Status`, `Age`, `Chidhood_Event`, `Exercise`, `Communicable_Disease`, `Hypertension`, `Other_Disease`, `Violence_Or_Abuse`, `Spiritual_Beliefs`, `Employment`, `Social_Exclusion`, `Access_To_Care`, `Marital_Status`, `Diet`, `Social_Support`, `Disability`, `Mental_Health`, `Alcohol`, `Insurance_Status`, `Substance_Quantity`, `Hyperlipidemia`, `Family_Member`, `Legal_Issues`, `Race_Ethnicity`, `Gender`, `Geographic_Entity`, `Sexual_Orientation`, `Transportation`, `Sexual_Activity`, `Language`, `Substance_Use`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_wip_en_4.2.8_3.0_1676135569606.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_wip_en_4.2.8_3.0_1676135569606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_sdoh_wip", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
])
sample_texts = [["Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week."]]
data = spark.createDataFrame(sample_texts).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_sdoh_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
))
val data = Seq("He continues to smoke one pack of cigarettes daily, as he has for the past 28 years.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_anatomy_coarse_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("content in the lung tissue")
```
```scala
val pipeline = new PretrainedPipeline("ner_anatomy_coarse_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("content in the lung tissue")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.anatomy_coarse_biobert.pipeline").predict("""content in the lung tissue""")
```
## Results
```bash
| | ner_chunk | entity |
|---:|:------------------|:----------|
| 0 | lung tissue | Anatomy |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_anatomy_coarse_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.0 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
---
layout: model
title: Chinese Word Segmentation
author: John Snow Labs
name: wordseg_large
date: 2021-01-03
task: Word Segmentation
language: zh
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, word_segmentation, zh, cn]
supported: true
annotator: WordSegmenterModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.
In this model, we created a curated large data set obtained from Chinese Treebank, Weibo, and SIGHAM 2005 data sets, and trained the neural network model as described in a research paper (Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_large_zh_2.7.0_2.4_1609681406666.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_large_zh_2.7.0_2.4_1609681406666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
word_segmenter = WordSegmenterModel.load("WORDSEG_LARGE_CN")\
.setInputCols("document")\
.setOutputCol("token")\
pipeline = Pipeline(stages=[document_assembler, word_segmenter])
ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"])
result = ws_model.transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")
.setInputCols("document")
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""然而,这样的处理也衍生了一些问题。"""]
token_df = nlu.load('zh.segment_words.large').predict(text, output_level='token')
token_df
```
## Results
```bash
+----------------------------------+--------------------------------------------------------+
|text |result |
+----------------------------------+--------------------------------------------------------+
|然而,这样的处理也衍生了一些问题。|[然而, ,, 这样, 的, 处理, 也, 衍生, 了, 一些, 问题, 。]|
+----------------------------------+--------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wordseg_large|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[token]|
|Language:|zh|
## Data Source
cn_wordseg_large_train.chartag
## Benchmarking
```bash
| Model | precision | recall | f1-score |
|---------------|--------------|--------------|--------------|
| WORSEG_CTB | 0,6453 | 0,6341 | 0,6397 |
| WORDSEG_WEIBO | 0,5454 | 0,5655 | 0,5553 |
| WORDSEG_MSRA | 0,5984 | 0,6088 | 0,6035 |
| WORDSEG_PKU | 0,6094 | 0,6321 | 0,6206 |
| WORDSEG_LARGE | 0,6326 | 0,6269 | 0,6297 |
```
---
layout: model
title: Mapping ICD10CM Codes with Their Corresponding UMLS Codes
author: John Snow Labs
name: icd10cm_umls_mapper
date: 2022-06-26
tags: [icd10cm, umls, chunk_mapper, clinical, licensed, en]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps ICD10CM codes to corresponding UMLS codes under the Unified Medical Language System (UMLS).
## Predicted Entities
`umls_code`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapper_en_3.5.3_3.0_1656278690210.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapper_en_3.5.3_3.0_1656278690210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
icd10cm_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icd10cm","en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN")
chunkerMapper = ChunkMapperModel\
.pretrained("icd10cm_umls_mapper", "en", "clinical/models")\
.setInputCols(["icd10cm_code"])\
.setOutputCol("umls_mappings")\
.setRels(["umls_code"])
pipeline = Pipeline(stages = [
documentAssembler,
sbert_embedder,
icd10cm_resolver,
chunkerMapper
])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_pipeline= LightPipeline(model)
result = light_pipeline.fullAnnotate("Neonatal skin infection")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val icd10cm_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icd10cm", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val chunkerMapper = ChunkMapperModel
.pretrained("icd10cm_umls_mapper", "en", "clinical/models")
.setInputCols(Array("rxnorm_code"))
.setOutputCol("umls_mappings")
.setRels(Array("umls_code"))
val pipeline = new Pipeline(stages = Array(
documentAssembler,
sbert_embedder,
icd10cm_resolver,
chunkerMapper
))
val data = Seq("Neonatal skin infection").toDS.toDF("text")
val result= pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.icd10cm_to_umls").predict("""Neonatal skin infection""")
```
## Results
```bash
| | ner_chunk | icd10cm_code | umls_mappings |
|---:|:------------------------|:---------------|:----------------|
| 0 | Neonatal skin infection | P394 | C0456111 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|icd10cm_umls_mapper|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[icd10cm_code]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|942.9 KB|
---
layout: model
title: English Named Entity Recognition (from surrey-nlp)
author: John Snow Labs
name: roberta_ner_roberta_large_finetuned_abbr
date: 2022-05-03
tags: [roberta, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-large-finetuned-abbr` is a English model orginally trained by `surrey-nlp`.
## Predicted Entities
`LF`, `AC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_large_finetuned_abbr_en_3.4.2_3.0_1651594192589.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_large_finetuned_abbr_en_3.4.2_3.0_1651594192589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_large_finetuned_abbr","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_large_finetuned_abbr","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.ner.roberta_large_finetuned_abbr").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_ner_roberta_large_finetuned_abbr|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/surrey-nlp/roberta-large-finetuned-abbr
- https://paperswithcode.com/sota?task=Token+Classification&dataset=surrey-nlp%2FPLOD-unfiltered
---
layout: model
title: English Bert Embeddings (from alexanderfalk)
author: John Snow Labs
name: bert_embeddings_danbert_small_cased
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `danbert-small-cased` is a English model orginally trained by `alexanderfalk`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_danbert_small_cased_en_3.4.2_3.0_1649672086620.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_danbert_small_cased_en_3.4.2_3.0_1649672086620.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_danbert_small_cased","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_danbert_small_cased","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.danbert_small_cased").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_danbert_small_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|313.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/alexanderfalk/danbert-small-cased
---
layout: model
title: Smaller BERT Embeddings (L-4_H-128_A-2)
author: John Snow Labs
name: small_bert_L4_128
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L4_128_en_2.6.0_2.4_1598344330158.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L4_128_en_2.6.0_2.4_1598344330158.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("small_bert_L4_128", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("small_bert_L4_128", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert.small_L4_128').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_bert_small_L4_128_embeddings
I [0.5109787583351135, 1.6565966606140137, 2.695....
love [1.0555483102798462, 1.8791943788528442, 1.285...
NLP [-0.23064681887626648, 0.939659833908081, 1.77...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|small_bert_L4_128|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|128|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1)
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from google)
author: John Snow Labs
name: t5_efficient_base_kv128
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-kv128` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_kv128_en_4.3.0_3.0_1675112746492.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_kv128_en_4.3.0_3.0_1675112746492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_kv128","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_kv128","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_kv128|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|637.5 MB|
## References
- https://huggingface.co/google/t5-efficient-base-kv128
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Firat)
author: John Snow Labs
name: distilbert_qa_firat_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Firat`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_firat_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768588053.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_firat_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768588053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_firat_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_firat_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_firat_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Firat/distilbert-base-uncased-finetuned-squad
---
layout: model
title: German Bert Embeddings (Base, Cased)
author: John Snow Labs
name: bert_embeddings_gbert_base
date: 2022-04-11
tags: [bert, embeddings, de, open_source]
task: Embeddings
language: de
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `gbert-base` is a German model orginally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_gbert_base_de_3.4.2_3.0_1649675902802.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_gbert_base_de_3.4.2_3.0_1649675902802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_gbert_base","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_gbert_base","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Funken NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.embed.gbert_base").predict("""Ich liebe Funken NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_gbert_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|412.6 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/deepset/gbert-base
- https://arxiv.org/pdf/2010.10906.pdf
- https://arxiv.org/pdf/2010.10906.pdf
- https://deepset.ai/german-bert
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/FARM
- https://github.com/deepset-ai/haystack/
- https://twitter.com/deepset_ai
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- http://www.deepset.ai/jobs
---
layout: model
title: XLNet Embeddings (Base)
author: John Snow Labs
name: xlnet_base_cased
date: 2020-04-28
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: XlnetEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking. The details are described in the paper "[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)"
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlnet_base_cased_en_2.5.0_2.4_1588074114942.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlnet_base_cased_en_2.5.0_2.4_1588074114942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = XlnetEmbeddings.pretrained("xlnet_base_cased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = XlnetEmbeddings.pretrained("xlnet_base_cased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.xlnet_base_cased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_xlnet_base_cased_embeddings
I [0.0027268705889582634, -3.5811028480529785, 0...
love [-4.020033836364746, -2.2760159969329834, 0.88...
NLP [-0.2549888491630554, -2.2768502235412598, 1.1...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlnet_base_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/zihangdai/xlnet](https://github.com/zihangdai/xlnet)
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from Ninh)
author: John Snow Labs
name: xlmroberta_ner_ninh_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Ninh`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_ninh_base_finetuned_panx_de_4.1.0_3.0_1660430037495.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_ninh_base_finetuned_panx_de_4.1.0_3.0_1660430037495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_ninh_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_ninh_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_ninh_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Ninh/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Pipeline to Detect PHI in Text
author: John Snow Labs
name: ner_deid_sd_large_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, deidentification, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_sd_large](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_sd_large_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_large_pipeline_en_3.4.1_3.0_1647870104226.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_large_pipeline_en_3.4.1_3.0_1647870104226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_sd_large_pipeline", "en", "clinical/models")
pipeline.annotate("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""")
```
```scala
val pipeline = new PretrainedPipeline("ner_deid_sd_large_pipeline", "en", "clinical/models")
pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.med_ner_large.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""")
```
## Results
```bash
+-----------------------------+--------+
|chunks |entities|
+-----------------------------+--------+
|2093-01-13 |DATE |
|David Hale |NAME |
|Hendrickson, Ora |NAME |
|7194334 |ID |
|01/13/93 |DATE |
|Oliveira |NAME |
|1-11-2000 |DATE |
|Cocke County Baptist Hospital|LOCATION|
|0295 Keats Street |LOCATION|
|786-5227 |CONTACT |
|Brothers Coal-Mine |LOCATION|
+-----------------------------+--------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_sd_large_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Bangla BertForMaskedLM Cased model (from neuralspace-reverie)
author: John Snow Labs
name: bert_embeddings_indic_transformers
date: 2022-12-02
tags: [bn, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: bn
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-bn-bert` is a Bangla model originally trained by `neuralspace-reverie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_bn_4.2.4_3.0_1670022315370.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_bn_4.2.4_3.0_1670022315370.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","bn") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","bn")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_indic_transformers|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|bn|
|Size:|505.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/neuralspace-reverie/indic-transformers-bn-bert
- https://oscar-corpus.com/
---
layout: model
title: Relation Extraction Between Dates and Clinical Entities (ReDL)
author: John Snow Labs
name: redl_date_clinical_biobert
date: 2023-01-14
tags: [licensed, en, clinical, relation_extraction, tensorflow]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Identify if tests were conducted on a particular date or any diagnosis was made on a specific date by checking relations between clinical entities and dates. 1 : Shows date and the clinical entity are related, 0 : Shows date and the clinical entity are not related.
## Predicted Entities
`1`, `0`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_DATE/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_date_clinical_biobert_en_4.2.4_3.0_1673731277460.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_date_clinical_biobert_en_4.2.4_3.0_1673731277460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
events_ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_chunker = NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentences", "pos_tags", "tokens"]) \
.setOutputCol("dependencies")
events_re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setOutputCol("re_ner_chunks")
events_re_Model = RelationExtractionDLModel() \
.pretrained('redl_date_clinical_biobert', "en", "clinical/models")\
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
events_ner_tagger,
ner_chunker,
dependency_parser,
events_re_ner_chunk_filter,
events_re_Model])
data = spark.createDataFrame([['''This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.''']]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val events_ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_chunker = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
val events_re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setOutputCol("re_ner_chunks")
val events_re_Model = RelationExtractionDLModel()
.pretrained("redl_date_clinical_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter,sentencer,tokenizer,words_embedder,pos_tagger,events_ner_tagger,ner_chunker,dependency_parser,events_re_ner_chunk_filter,events_re_Model))
val data = Seq("This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.date").predict("""This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.""")
```
## Results
```bash
+--------+-------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+
|relation|entity1|entity1_begin|entity1_end| chunk1|entity2|entity2_begin|entity2_end| chunk2|confidence|
+--------+-------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+
| 1| TEST| 24| 25| CT| DATE| 30| 36| 1/12/95|0.99997973|
| 1| TEST| 24| 25| CT|PROBLEM| 44| 83|progressive memor...| 0.9998983|
| 1| TEST| 24| 25| CT| DATE| 91| 97| 8/11/94| 0.9997316|
| 1| DATE| 30| 36| 1/12/95|PROBLEM| 44| 83|progressive memor...| 0.9998915|
| 1| DATE| 30| 36| 1/12/95| DATE| 91| 97| 8/11/94| 0.9997931|
| 1|PROBLEM| 44| 83|progressive memor...| DATE| 91| 97| 8/11/94| 0.9998667|
+--------+-------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_date_clinical_biobert|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|401.7 MB|
## References
Trained on an internal dataset.
## Benchmarking
```bash
label Recall Precision F1 Support
0 0.738 0.729 0.734 84
1 0.945 0.947 0.946 416
Avg. 0.841 0.838 0.840 -
```
---
layout: model
title: Turkish BertForQuestionAnswering Base Cased model (from husnu)
author: John Snow Labs
name: bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3
date: 2022-07-07
tags: [tr, open_source, bert, question_answering]
task: Question Answering
language: tr
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-turkish-128k-cased-finetuned_lr-2e-05_epochs-3` is a Turkish model originally trained by `husnu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3_tr_4.0.0_3.0_1657183554228.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3_tr_4.0.0_3.0_1657183554228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3","tr") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3","tr")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|tr|
|Size:|689.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/husnu/bert-base-turkish-128k-cased-finetuned_lr-2e-05_epochs-3
---
layout: model
title: Fast Neural Machine Translation Model from English to Setswana
author: John Snow Labs
name: opus_mt_en_tn
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, tn, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `tn`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tn_xx_2.7.0_2.4_1609167140357.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tn_xx_2.7.0_2.4_1609167140357.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_tn", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_tn", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.tn').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_tn|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Chinese NER Model
author: John Snow Labs
name: bert_token_classifier_chinese_ner
date: 2021-12-07
tags: [chinese, token_classifier, bert, zh, open_source]
task: Named Entity Recognition
language: zh
edition: Spark NLP 3.3.2
spark_version: 2.4
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model was imported from `Hugging Face` and it's been fine-tuned for traditional Chinese language, leveraging `Bert` embeddings and `BertForTokenClassification` for NER purposes.
## Predicted Entities
`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_chinese_ner_zh_3.3.2_2.4_1638881767667.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_chinese_ner_zh_3.3.2_2.4_1638881767667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_chinese_ner", "zh"))\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = """我是莎拉,我从 1999 年 11 月 2 日。开始在斯图加特的梅赛德斯-奔驰公司工作。"""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_chinese_ner", "zh"))
.setInputCols(Array("sentence","token"))
.setOutputCol("ner")
ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val example = Seq.empty["我是莎拉,我从 1999 年 11 月 2 日。开始在斯图加特的梅赛德斯-奔驰公司工作。"].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.ner.bert_token").predict("""我是莎拉,我从 1999 年 11 月 2 日。开始在斯图加特的梅赛德斯-奔驰公司工作。""")
```
## Results
```bash
+-----------------+---------+
|chunk |ner_label|
+-----------------+---------+
|莎拉 |PERSON |
|1999 年 11 月 2 |DATE |
|斯图加特 |GPE |
|梅赛德斯-奔驰公司 |ORG |
+-----------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_chinese_ner|
|Compatibility:|Spark NLP 3.3.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|zh|
|Case sensitive:|true|
|Max sentense length:|256|
## Data Source
[https://huggingface.co/ckiplab/bert-base-chinese-ner](https://huggingface.co/ckiplab/bert-base-chinese-ner)
## Benchmarking
```bash
label score
f1 0.8118
```
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC2GM-Gene_Imbalancedscibert_scivocab_cased` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`GENE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased_en_4.0.0_3.0_1657108037612.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased_en_4.0.0_3.0_1657108037612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|410.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC2GM-Gene_Imbalancedscibert_scivocab_cased
---
layout: model
title: English DistilBertForQuestionAnswering model (from charlieoneill)
author: John Snow Labs
name: distilbert_qa_base_uncased_gradient_clinic
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-gradient-clinic` is a English model originally trained by `charlieoneill`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_gradient_clinic_en_4.0.0_3.0_1654727035184.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_gradient_clinic_en_4.0.0_3.0_1654727035184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_gradient_clinic","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_gradient_clinic","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.base_uncased.by_charlieoneill").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_gradient_clinic|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/charlieoneill/distilbert-base-uncased-gradient-clinic
---
layout: model
title: Thai Word Segmentation
author: John Snow Labs
name: wordseg_best
date: 2021-01-13
task: Word Segmentation
language: th
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [th, word_segmentation, open_source]
supported: true
annotator: WordSegmenterModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Thai text. Thai text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.
References:
- Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_best_th_2.7.0_2.4_1610543628078.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_best_th_2.7.0_2.4_1610543628078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
word_segmenter = WordSegmenterModel.pretrained('wordseg_best', 'th')\
.setInputCols("document")\
.setOutputCol("token")
pipeline = Pipeline(stages=[document_assembler, word_segmenter])
example = spark.createDataFrame([['จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ']], ["text"])
result = pipeline.fit(example ).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")
.setInputCols("document")
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส"""]
token_df = nlu.load('th.segment_words').predict(text)
token_df
```
## Results
```bash
+-----------------------------------+---------------------------------------------------------+
|text |result |
+-----------------------------------+---------------------------------------------------------+
|จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ|[จวน, จะ, ถึง, ร้าน, ที่, คุณ, จอง, โต๊ะ, ไว้, แล้ว, จ้ะ]|
+-----------------------------------+---------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wordseg_best|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[token]|
|Language:|th|
## Data Source
The model was trained on the [BEST](http://thailang.nectec.or.th/best) corpus from the National Electronics and Computer Technology Center (NECTEC).
References:
> - Krit Kosawat, Monthika Boriboon, Patcharika Chootrakool, Ananlada Chotimongkol, Supon Klaithin, Sarawoot Kongyoung, Kanyanut Kriengket, Sitthaa Phaholphinyo, Sumonmas Purodakananda, Tipraporn Thanakulwarapas, and Chai Wutiwiwatchai, "BEST 2009: Thai word segmentation software contest," in Proc. 8th Int. Symp. Natural Language Process. (SNLP), Bangkok, Thailand, Oct.20-22, 2009, pp.83-88.
> - Monthika Boriboon, Kanyanut Kriengket, Patcharika Chootrakool, Sitthaa Phaholphinyo, Sumonmas Purodakananda, Tipraporn Thanakulwarapas, and Krit Kosawat, "BEST corpus development and analysis," in Proc. 2nd Int. Conf. Asian Language Process. (IALP), Singapore, Dec.7-9, 2009, pp.322-327.
## Benchmarking
```bash
| Model | precision | recall | f1-score |
|--------------|-----------|--------|----------|
| WORDSEG_BEST | 0.4791 | 0.6245 | 0.5422 |
```
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_nl2
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nl2` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl2_en_4.3.0_3.0_1675123819194.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl2_en_4.3.0_3.0_1675123819194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_nl2","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_nl2","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_nl2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|39.0 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-nl2
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English asr_wav2vec2_base_100h_with_lm_by_saahith TFWav2Vec2ForCTC from saahith
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_with_lm_by_saahith` is a English model originally trained by saahith.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith_en_4.2.0_3.0_1664117830342.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith_en_4.2.0_3.0_1664117830342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|227.9 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Detect Adverse Drug Events (healthcare)
author: John Snow Labs
name: ner_ade_healthcare
date: 2021-04-01
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Detect adverse drug events in tweets, reviews, and medical text using pretrained NER model.
## Predicted Entities
`DRUG`, `ADE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_healthcare_en_3.0.0_3.0_1617260836627.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_healthcare_en_3.0.0_3.0_1617260836627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_ade_healthcare", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_ade_healthcare", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.ade.ade_healthcare").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_ade_healthcare|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Benchmarking
```bash
+------+------+------+------+-------+---------+------+------+
|entity| tp| fp| fn| total|precision|recall| f1|
+------+------+------+------+-------+---------+------+------+
| DRUG|9649.0| 884.0|9772.0|19421.0| 0.9161|0.4968|0.6443|
| ADE|5909.0|9508.0|1987.0| 7896.0| 0.3833|0.7484|0.5069|
+------+------+------+------+-------+---------+------+------+
+------------------+
| macro|
+------------------+
|0.5755909944827655|
+------------------+
+------------------+
| micro|
+------------------+
|0.6045600310939989|
+------------------+
```
---
layout: model
title: Pipeline to Resolve CVX Codes
author: John Snow Labs
name: cvx_resolver_pipeline
date: 2023-03-30
tags: [en, licensed, clinical, resolver, chunk_mapping, cvx, pipeline]
task: Pipeline Healthcare
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline maps entities with their corresponding CVX codes. You’ll just feed your text and it will return the corresponding CVX codes.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/cvx_resolver_pipeline_en_4.3.2_3.2_1680178011294.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/cvx_resolver_pipeline_en_4.3.2_3.2_1680178011294.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
resolver_pipeline = PretrainedPipeline("cvx_resolver_pipeline", "en", "clinical/models")
text= "The patient has a history of influenza vaccine, tetanus and DTaP"
result = resolver_pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val resolver_pipeline = new PretrainedPipeline("cvx_resolver_pipeline", "en", "clinical/models")
val result = resolver_pipeline.fullAnnotate("The patient has a history of influenza vaccine, tetanus and DTaP")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.cvx_pipeline").predict("""The patient has a history of influenza vaccine, tetanus and DTaP""")
```
## Results
```bash
+-----------------+---------+--------+
|chunk |ner_chunk|cvx_code|
+-----------------+---------+--------+
|influenza vaccine|Vaccine |160 |
|tetanus |Vaccine |35 |
|DTaP |Vaccine |20 |
+-----------------+---------+--------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|cvx_resolver_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|2.1 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
- ChunkMapperModel
- ChunkMapperFilterer
- Chunk2Doc
- BertSentenceEmbeddings
- SentenceEntityResolverModel
- ResolverMerger
---
layout: model
title: Detect Assertion Status from Oncology Entities
author: John Snow Labs
name: assertion_oncology_wip
date: 2022-10-01
tags: [licensed, clinical, oncology, en, assertion]
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.0
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model detects the assertion status of entities related to oncology (including diagnoses, therapies and tests).
## Predicted Entities
`Absent`, `Family`, `Hypothetical`, `Past`, `Possible`, `Present`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_wip_en_4.1.0_3.0_1664641275549.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_wip_en_4.1.0_3.0_1664641275549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["Cancer_Dx", "Tumor_Finding", "Cancer_Surgery", "Chemotherapy", "Pathology_Test", "Imaging_Test"])
assertion = AssertionDLModel.pretrained("assertion_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
assertion])
data = spark.createDataFrame([["The patient is suspected to have breast cancer. Family history is positive for other cancers. The result of the biopsy was positive."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Cancer_Dx", "Tumor_Finding", "Cancer_Surgery", "Chemotherapy", "Pathology_Test", "Imaging_Test"))
val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_wip","en","clinical/models")
.setInputCols(Array("sentence","ner_chunk","embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
assertion))
val data = Seq("""The patient is suspected to have breast cancer. Family history is positive for other cancers. The result of the biopsy was positive.""").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert.oncology_wip").predict("""The patient is suspected to have breast cancer. Family history is positive for other cancers. The result of the biopsy was positive.""")
```
## Results
```bash
| chunk | ner_label | assertion |
|:--------------|:---------------|:------------|
| breast cancer | Cancer_Dx | Possible |
| cancers | Cancer_Dx | Family |
| biopsy | Pathology_Test | Past |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|assertion_oncology_wip|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, chunk, embeddings]|
|Output Labels:|[assertion_pred]|
|Language:|en|
|Size:|1.4 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label precision recall f1-score support
Absent 0.81 0.77 0.79 264.0
Family 0.78 0.82 0.80 34.0
Hypothetical 0.67 0.61 0.64 182.0
Past 0.91 0.93 0.92 1583.0
Possible 0.59 0.59 0.59 51.0
Present 0.89 0.89 0.89 1645.0
macro-avg 0.77 0.77 0.77 3759.0
weighted-avg 0.88 0.88 0.88 3759.0
```
---
layout: model
title: Finnish RobertaForQuestionAnswering (from cgou)
author: John Snow Labs
name: roberta_qa_fin_RoBERTa_v1_finetuned_squad
date: 2022-06-20
tags: [open_source, question_answering, roberta]
task: Question Answering
language: fi
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fin_RoBERTa-v1-finetuned-squad` is a Finnish model originally trained by `cgou`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fin_RoBERTa_v1_finetuned_squad_fi_4.0.0_3.0_1655728569389.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fin_RoBERTa_v1_finetuned_squad_fi_4.0.0_3.0_1655728569389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_fin_RoBERTa_v1_finetuned_squad","fi") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_fin_RoBERTa_v1_finetuned_squad","fi")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("fi.answer_question.squad.roberta.by_cgou").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_fin_RoBERTa_v1_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|fi|
|Size:|248.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/cgou/fin_RoBERTa-v1-finetuned-squad
---
layout: model
title: Mapping Drugs from the KEGG Database to Their Efficacies, Molecular Weights and Corresponding Codes from Other Databases
author: John Snow Labs
name: kegg_drug_mapper
date: 2022-11-21
tags: [drug, efficacy, molecular_weight, cas, pubchem, chebi, ligandbox, nikkaji, pdbcct, chunk_mapper, clinical, en, licensed]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps drugs with their corresponding `efficacy`, `molecular_weight` as well as `CAS`, `PubChem`, `ChEBI`, `LigandBox`, `NIKKAJI`, `PDB-CCD` codes. This model was trained with the data from the KEGG database.
## Predicted Entities
`efficacy`, `molecular_weight`, `CAS`, `PubChem`, `ChEBI`, `LigandBox`, `NIKKAJI`, `PDB-CCD`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/kegg_drug_mapper_en_4.2.2_3.0_1669069910375.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/kegg_drug_mapper_en_4.2.2_3.0_1669069910375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
chunkerMapper = ChunkMapperModel.pretrained("kegg_drug_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setRels(["efficacy", "molecular_weight", "CAS", "PubChem", "ChEBI", "LigandBox", "NIKKAJI", "PDB-CCD"])\
pipeline = Pipeline().setStages([
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
converter,
chunkerMapper])
text= "She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin"
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val chunkerMapper = ChunkMapperModel.pretrained("kegg_drug_mapper", "en", "clinical/models")
.setInputCols("ner_chunk")
.setOutputCol("mappings")
.setRels(Array("efficacy", "molecular_weight", "CAS", "PubChem", "ChEBI", "LigandBox", "NIKKAJI", "PDB-CCD"))
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
converter,
chunkerMapper))
val text= "She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin"
val data = Seq(text).toDS.toDF("text")
val result= pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.kegg_drug").predict("""She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin""")
```
## Results
```bash
+-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+
| ner_chunk| efficacy|molecular_weight| CAS| PubChem| ChEBI|LigandBox| NIKKAJI|PDB-CCD|
+-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+
| OxyContin| Analgesic (narcotic), Opioid receptor agonist| 351.8246| 124-90-3| 7847912.0| 7859.0| D00847|J281.239H| NONE|
| folic acid|Anti-anemic, Hematopoietic, Supplement (folic a...| 441.3975| 59-30-3| 7847138.0|27470.0| D00070| J1.392G| FOL|
|levothyroxine| Replenisher (thyroid hormone)| 776.87| 51-48-9|9.6024815E7|18332.0| D08125| J4.118A| T44|
| Norvasc|Antihypertensive, Vasodilator, Calcium channel ...| 408.8759|88150-42-9|5.1091781E7| 2668.0| D07450| J33.383B| NONE|
| aspirin|Analgesic, Anti-inflammatory, Antipyretic, Anti...| 180.1574| 50-78-2| 7847177.0|15365.0| D00109| J2.300K| AIN|
| Neurontin| Anticonvulsant, Antiepileptic| 171.2368|60142-96-3| 7847398.0|42797.0| D00332| J39.388F| GBN|
+-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|kegg_drug_mapper|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|1.0 MB|
---
layout: model
title: Translate English to Haitian Creole Pipeline
author: John Snow Labs
name: translate_en_ht
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, ht, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `ht`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ht_xx_2.7.0_2.4_1609688301365.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ht_xx_2.7.0_2.4_1609688301365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_ht", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_ht", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.ht').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_ht|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: German Bert Embeddings (Base, Cased, Old Vocabulary)
author: John Snow Labs
name: bert_embeddings_bert_base_german_cased_oldvocab
date: 2022-04-11
tags: [bert, embeddings, de, open_source]
task: Embeddings
language: de
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-german-cased-oldvocab` is a German model orginally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_cased_oldvocab_de_3.4.2_3.0_1649676274361.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_cased_oldvocab_de_3.4.2_3.0_1649676274361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_cased_oldvocab","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_cased_oldvocab","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Funken NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.embed.bert_base_german_cased_oldvocab").predict("""Ich liebe Funken NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_german_cased_oldvocab|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|409.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/deepset/bert-base-german-cased-oldvocab
- https://github.com/deepset-ai/FARM/issues/60
- https://deepset.ai/german-bert
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/FARM
- https://github.com/deepset-ai/haystack/
- https://twitter.com/deepset_ai
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- http://www.deepset.ai/jobs
---
layout: model
title: Persian BertForQuestionAnswering model (from SajjadAyoubi)
author: John Snow Labs
name: bert_qa_bert_base_fa_qa
date: 2022-06-02
tags: [fa, open_source, question_answering, bert]
task: Question Answering
language: fa
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-fa-qa` is a Persian model orginally trained by `SajjadAyoubi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_fa_qa_fa_4.0.0_3.0_1654179918056.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_fa_qa_fa_4.0.0_3.0_1654179918056.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_fa_qa","fa") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_fa_qa","fa")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("fa.answer_question.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_fa_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fa|
|Size:|607.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/SajjadAyoubi/bert-base-fa-qa
- https://colab.research.google.com/github/sajjjadayobi/PersianQA/blob/main/notebooks/HowToUse.ipynb
---
layout: model
title: Translate English to East Slavic languages Pipeline
author: John Snow Labs
name: translate_en_zle
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, zle, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `zle`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_zle_xx_2.7.0_2.4_1609691744462.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_zle_xx_2.7.0_2.4_1609691744462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_zle", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_zle", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.zle').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_zle|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Universal Sentence Encoder XLING Many
author: John Snow Labs
name: tfhub_use_xling_many
date: 2020-12-08
task: Embeddings
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
deprecated: true
tags: [embeddings, open_source, xx]
supported: true
annotator: UniversalSentenceEncoder
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The Universal Sentence Encoder Cross-lingual (XLING) module is an extension of the Universal Sentence Encoder that includes training on multiple tasks across languages. The multi-task training setup is based on the paper "Learning Cross-lingual Sentence Representations via a Multi-task Dual Encoder".
This specific module is trained on English, French, German, Spanish, Italian, Chinese, Korean, and Japanese tasks, and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.
It is trained on a variety of data sources and tasks, with the goal of learning text representations that are useful out-of-the-box for a number of applications. The input to the module is variable length text in any of the eight aforementioned languages and the output is a 512 dimensional vector.
We note that one does not need to specify the language of the input, as the model was trained such that text across languages with similar meanings will have embeddings with high dot product scores.
Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_many_xx_2.7.0_2.4_1607440840968.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_many_xx_2.7.0_2.4_1607440840968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_many", "xx") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Me encanta usar SparkNLP']], ["text"]))
```
```scala
val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_many", "xx")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I love NLP", "Me encanta usar SparkNLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP", "Me encanta usar SparkNLP"]
embeddings_df = nlu.load('xx.use.xling_many').predict(text, output_level='sentence')
embeddings_df
```
## Results
It gives a 512-dimensional vector of the sentences.
```bash
xx_use_xling_many_embeddings sentence
0 [0.03621278703212738, 0.007045685313642025, -0... I love NLP
1 [-0.0060035050846636295, 0.028749311342835426,... Me encanta usar SparkNLP
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|tfhub_use_xling_many|
|Compatibility:|Spark NLP 2.7.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|xx|
## Data Source
[https://tfhub.dev/google/universal-sentence-encoder-xling-many/1](https://tfhub.dev/google/universal-sentence-encoder-xling-many/1)
---
layout: model
title: Pipeline to Mapping MESH Codes with Their Corresponding UMLS Codes
author: John Snow Labs
name: mesh_umls_mapping
date: 2022-06-27
tags: [mesh, umls, chunk_mapper, pipeline, clinical, licensed, en]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of `mesh_umls_mapper` model.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_3.5.3_3.0_1656366727552.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_3.5.3_3.0_1656366727552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline= PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models")
result = pipeline.fullAnnotate("C028491 D019326 C579867")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline= new PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models")
val result = pipeline.fullAnnotate("C028491 D019326 C579867")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.mesh.umls").predict("""C028491 D019326 C579867""")
```
## Results
```bash
| | mesh_code | umls_code |
|---:|:----------------------------|:-------------------------------|
| 0 | C028491 | D019326 | C579867 | C0043904 | C0045010 | C3696376 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|mesh_umls_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|3.8 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ChunkMapperModel
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_nh16
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nh16` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh16_en_4.3.0_3.0_1675123654269.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh16_en_4.3.0_3.0_1675123654269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_nh16","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_nh16","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_nh16|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|64.1 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-nh16
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Extract relations between phenotypic abnormalities and diseases (ReDL)
author: John Snow Labs
name: redl_human_phenotype_gene_biobert
date: 2021-07-24
tags: [relation_extraction, en, licensed, clinical]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 3.0.3
spark_version: 2.4
supported: true
annotator: RelationExtractionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract relations to fully understand the origin of some phenotypic abnormalities and their associated diseases. `1` : Entities are related, `0` : Entities are not related.
## Predicted Entities
`0`, `1`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_human_phenotype_gene_biobert_en_3.0.3_2.4_1627120647767.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_human_phenotype_gene_biobert_en_3.0.3_2.4_1627120647767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = sparknlp.annotators.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = NerConverter() \
.setInputCols(["sentences", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentences", "pos_tags", "tokens"]) \
.setOutputCol("dependencies")
#Set a filter on pairs of named entities which will be treated as relation candidates
re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setMaxSyntacticDistance(10)\
.setOutputCol("re_ner_chunks")
# The dataset this model is trained to is sentence-wise.
# This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
re_model = RelationExtractionDLModel()\
.pretrained('redl_human_phenotype_gene_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model])
text = """She has a retinal degeneration, hearing loss and renal failure, short stature, Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive."""
data = spark.createDataFrame([[text]]).toDF("text")
p_model = pipeline.fit(data)
result = p_model.transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverter()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val re_ner_chunk_filter = RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
// The dataset this model is trained to is sentence-wise.
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val re_model = RelationExtractionDLModel()
.pretrained("redl_human_phenotype_gene_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model))
val data = Seq("""She has a retinal degeneration, hearing loss and renal failure, short stature, Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.humen_phenotype_gene").predict("""She has a retinal degeneration, hearing loss and renal failure, short stature, Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive.""")
```
## Results
```bash
| | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence |
|---:|-----------:|:----------|----------------:|--------------:|:---------------------|:----------|----------------:|--------------:|:--------------------|-------------:|
| 0 | 0 | HP | 10 | 29 | retinal degeneration | HP | 32 | 43 | hearing loss | 0.893809 |
| 1 | 0 | HP | 10 | 29 | retinal degeneration | HP | 49 | 61 | renal failure | 0.958486 |
| 2 | 1 | HP | 10 | 29 | retinal degeneration | HP | 162 | 180 | autosomal recessive | 0.65584 |
| 3 | 0 | HP | 32 | 43 | hearing loss | HP | 64 | 76 | short stature | 0.707055 |
| 4 | 1 | HP | 32 | 43 | hearing loss | GENE | 96 | 103 | SH3PXD2B | 0.640802 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_human_phenotype_gene_biobert|
|Compatibility:|Healthcare NLP 3.0.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Case sensitive:|true|
## Data Source
Trained on a silver standard corpus of human phenotype and gene annotations and their relations.
## Benchmarking
```bash
Relation Recall Precision F1 Support
0 0.922 0.908 0.915 129
1 0.831 0.855 0.843 71
Avg. 0.877 0.882 0.879 -
```
---
layout: model
title: English image_classifier_vit_exper3_mesum5 ViTForImageClassification from sudo-s
author: John Snow Labs
name: image_classifier_vit_exper3_mesum5
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper3_mesum5` is a English model originally trained by sudo-s.
## Predicted Entities
`45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper3_mesum5_en_4.1.0_3.0_1660167974762.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper3_mesum5_en_4.1.0_3.0_1660167974762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_exper3_mesum5", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_exper3_mesum5", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_exper3_mesum5|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|322.3 MB|
---
layout: model
title: Translate Bulgarian to English Pipeline
author: John Snow Labs
name: translate_bg_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, bg, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `bg`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bg_en_xx_2.7.0_2.4_1609691570462.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bg_en_xx_2.7.0_2.4_1609691570462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_bg_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_bg_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.bg.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_bg_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Financial SEC Filings Classifier
author: John Snow Labs
name: finclf_sec_filings
date: 2022-12-01
tags: [en, finance, classification, sec, licensed]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: FinanceClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model allows you to classify documents among a list of specific US Security Exchange Commission filings, as : `10-K`, `10-Q`, `8-K`, `S-8`, `3`, `4`, `Other`
**IMPORTANT** : This model works with the first 512 tokens of a document, you don't need to run it in the whole document.
## Predicted Entities
`10-K`, `10-Q`, `8-K`, `S-8`, `3`, `4`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_sec_filings_en_1.0.0_3.0_1669921534523.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_sec_filings_en_1.0.0_3.0_1669921534523.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
model = WordEmbeddingsModel.pretrained("embeddings_scielo_150d","es","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
```
```scala
val model = WordEmbeddingsModel.pretrained("embeddings_scielo_150d","es","clinical/models")
.setInputCols("document","token")
.setOutputCol("word_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("es.embed.scielo.150d").predict("""Put your text here.""")
```
{:.h2_title}
## Results
Word2Vec feature vectors based on ``embeddings_scielo_150d``.
{:.model-param}
## Model Information
{:.table-model}
|---------------|------------------------|
| Name: | embeddings_scielo_150d |
| Type: | WordEmbeddingsModel |
| Compatibility: | Spark NLP 2.5.0+ |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [document, token] |
|Output labels: | [word_embeddings] |
| Language: | es |
| Dimension: | 150.0 |
{:.h2_title}
## Data Source
Trained on Scielo Articles
https://zenodo.org/record/3744326#.XtViinVKh_U
---
layout: model
title: Pipeline to Detect Organism in Medical Text
author: John Snow Labs
name: bert_token_classifier_ner_species_pipeline
date: 2023-03-20
tags: [en, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_species](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_species_en_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_species_pipeline_en_4.3.0_3.2_1679301125473.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_species_pipeline_en_4.3.0_3.2_1679301125473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_species_pipeline", "en", "clinical/models")
text = '''As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) .'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_species_pipeline", "en", "clinical/models")
val text = "As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) ."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------------|--------:|------:|:------------|-------------:|
| 0 | 6C (T) | 57 | 62 | SPECIES | 0.998955 |
| 1 | Betaproteobacteria | 117 | 134 | SPECIES | 0.99973 |
| 2 | Thiomonas intermedia | 167 | 186 | SPECIES | 0.999822 |
| 3 | DSM 18155 (T) | 188 | 200 | SPECIES | 0.997657 |
| 4 | Thiomonas perometabolis | 206 | 228 | SPECIES | 0.999614 |
| 5 | DSM 18570 (T) | 230 | 242 | SPECIES | 0.997146 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_species_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: Multilingual DistilBertForQuestionAnswering Cased model (from ZYW)
author: John Snow Labs
name: distilbert_qa_zyw_model
date: 2023-01-03
tags: [xx, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: xx
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `en-de-vi-zh-es-model` is a Multilingual model originally trained by `ZYW`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_zyw_model_xx_4.3.0_3.0_1672775050259.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_zyw_model_xx_4.3.0_3.0_1672775050259.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zyw_model","xx")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zyw_model","xx")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_zyw_model|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|xx|
|Size:|505.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ZYW/en-de-vi-zh-es-model
---
layout: model
title: Extract Clinical Department Entities from Voice of the Patient Documents (embeddings_clinical_large)
author: John Snow Labs
name: ner_vop_clinical_dept_emb_clinical_large
date: 2023-06-06
tags: [licensed, clinical, en, ner, vop]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts medical devices and clinical department mentions terms from the documents transferred from the patient’s own sentences.
## Predicted Entities
`ClinicalDept`, `AdmissionDischarge`, `MedicalDevice`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_emb_clinical_large_en_4.4.3_3.0_1686074681308.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_emb_clinical_large_en_4.4.3_3.0_1686074681308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_emb_clinical_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_emb_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:----------------------|:--------------|
| orthopedic department | ClinicalDept |
| titanium plate | MedicalDevice |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_clinical_dept_emb_clinical_large|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.8 MB|
|Dependencies:|embeddings_clinical_large|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
ClinicalDept 297 35 29 326 0.89 0.91 0.90
AdmissionDischarge 25 0 9 34 1.00 0.74 0.85
MedicalDevice 256 64 76 332 0.80 0.77 0.79
macro_avg 578 99 114 692 0.90 0.81 0.85
micro_avg 578 99 114 692 0.85 0.83 0.84
```
---
layout: model
title: Legal Eu Finance Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_eu_finance_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, eu_finance, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_eu_finance_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Eu_Finance or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Eu_Finance`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_eu_finance_bert_en_1.0.0_3.0_1678111884579.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_eu_finance_bert_en_1.0.0_3.0_1678111884579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Eu_Finance]|
|[Other]|
|[Other]|
|[Eu_Finance]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_eu_finance_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.8 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Eu_Finance 0.89 0.87 0.88 622
Other 0.85 0.87 0.86 529
accuracy - - 0.87 1151
macro-avg 0.87 0.87 0.87 1151
weighted-avg 0.87 0.87 0.87 1151
```
---
layout: model
title: Chinese BertForMaskedLM Cased model (from hfl)
author: John Snow Labs
name: bert_embeddings_rbt4
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt4` is a Chinese model originally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_zh_4.2.4_3.0_1670327086297.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_zh_4.2.4_3.0_1670327086297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_rbt4|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|171.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/rbt4
- https://arxiv.org/abs/1906.08101
- https://github.com/google-research/bert
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/1906.08101
---
layout: model
title: Detect Drug Chemicals
author: John Snow Labs
name: ner_drugs_large_en
date: 2021-01-29
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP for Healthcare 2.7.1
spark_version: 2.4
tags: [ner, en, licensed, clinical]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Pretrained named entity recognition deep learning model for Drugs. The model combines dosage, strength, form, and route into a single entity: Drug. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
{:.h2_title}
## Predicted Entities
`DRUG`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_large_en_2.6.0_2.4_1603915964112.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_large_en_2.6.0_2.4_1603915964112.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_drugs_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
data = spark.createDataFrame([["""The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain."""]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
# Clinical word embeddings trained on PubMED dataset
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_drugs_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata.
```bash
+--------------------------------+---------+
|chunk |ner_label|
+--------------------------------+---------+
|Aspirin 81 milligrams |DRUG |
|Humulin N |DRUG |
|insulin 50 units |DRUG |
|HCTZ 50 mg |DRUG |
|Nitroglycerin 1/150 sublingually|DRUG |
+--------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_drugs_large_en_2.6.0_2.4|
|Type:|ner|
|Compatibility:|Spark NLP 2.7.1+|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence,token, embeddings]|
|Output Labels:|[ner]|
|Language:|[en]|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on i2b2_med7 + FDA with 'embeddings_clinical'.
https://www.i2b2.org/NLP/Medication
{:.h2_title}
## Benchmarking
Since this NER model is crafted from `ner_posology` but reduced to single entity, no benchmark is applicable.
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265908` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908_en_4.0.0_3.0_1655985559166.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908_en_4.0.0_3.0_1655985559166.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265908").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|888.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265908
---
layout: model
title: English image_classifier_vit_pond_image_classification_7 ViTForImageClassification from SummerChiam
author: John Snow Labs
name: image_classifier_vit_pond_image_classification_7
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_7` is a English model originally trained by SummerChiam.
## Predicted Entities
`Normal`, `Boiling`, `Algae`, `NormalCement`, `NormalRain`, `BoilingNight`, `NormalNight`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_7_en_4.1.0_3.0_1660167146003.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_7_en_4.1.0_3.0_1660167146003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_pond_image_classification_7", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_pond_image_classification_7", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_pond_image_classification_7|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Earning Calls Financial NER (Specific, md)
author: John Snow Labs
name: finner_earning_calls_specific_md
date: 2022-12-15
tags: [en, finance, ner, licensed, earning, calls]
task: Named Entity Recognition
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a `md` (medium) version of a financial model trained on Earning Calls transcripts to detect financial entities (NER model).
This model is called `Specific` as it has more labels in comparison with the `Generic` version.
Please note this model requires some tokenization configuration to extract the currency (see python snippet below).
The currently available entities are:
- AMOUNT: Numeric amounts, not percentages
- ASSET: Current or Fixed Asset
- ASSET_DECREASE: Decrease in the asset possession/exposure
- ASSET_INCREASE: Increase in the asset possession/exposure
- CF: Total cash flow
- CFO: Cash flow from operating activity
- CFO_INCREASE: Cash flow from operating activity increased
- CONTRA_LIABILITY: Negative liability account that offsets the liability account (e.g. paying a debt)
- COUNT: Number of items (not monetary, not percentages).
- CURRENCY: The currency of the amount
- DATE: Generic dates in context where either it's not a fiscal year or it can't be asserted as such given the context
- EXPENSE: An expense or loss
- EXPENSE_DECREASE: A piece of information saying there was an expense decrease in that fiscal year
- EXPENSE_INCREASE: A piece of information saying there was an expense increase in that fiscal year
- FCF: Free Cash Flow
- FISCAL_YEAR: A date which expresses which month the fiscal exercise was closed for a specific year
- INCOME: Any income that is reported
- INCOME_INCREASE: Relative increase in income
- KPI: Key Performance Indicator, a quantifiable measure of performance over time for a specific objective
- KPI_DECREASE: Relative decrease in a KPI
- KPI_INCREASE: Relative increase in a KPI
- LIABILITY: Current or Long-Term Liability (not from stockholders)
- LIABILITY_DECREASE: Relative decrease in liability
- LIABILITY_INCREASE: Relative increase in liability
- LOSS: Type of loss (e.g. gross, net)
- ORG: Mention to a company/organization name
- PERCENTAGE: : Numeric amounts which are percentages
- PROFIT: Profit or also Revenue
- PROFIT_DECLINE: A piece of information saying there was a profit / revenue decrease in that fiscal year
- PROFIT_INCREASE: A piece of information saying there was a profit / revenue increase in that fiscal year
- REVENUE: Revenue reported by company
- REVENUE_DECLINE: Relative decrease in revenue when compared to other years
- REVENUE_INCREASE: Relative increase in revenue when compared to other years
- STOCKHOLDERS_EQUITY: Equity possessed by stockholders, not liability
- TICKER: Trading symbol of the company
## Predicted Entities
`AMOUNT`, `ASSET`, `ASSET_DECREASE`, `ASSET_INCREASE`, `CF`, `CFO`, `CFO_INCREASE`, `CF_INCREASE`, `CONTRA_LIABILITY`, `COUNT`, `CURRENCY`, `DATE`, `EXPENSE`, `EXPENSE_DECREASE`, `EXPENSE_INCREASE`, `FCF`, `FISCAL_YEAR`, `INCOME`, `INCOME_INCREASE`, `KPI`, `KPI_DECREASE`, `KPI_INCREASE`, `LIABILITY`, `LIABILITY_DECREASE`, `LIABILITY_INCREASE`, `LOSS`, `LOSS_DECREASE`, `ORG`, `PERCENTAGE`, `PROFIT`, `PROFIT_DECLINE`, `PROFIT_INCREASE`, `REVENUE`, `REVENUE_DECLINE`, `REVENUE_INCREASE`, `STOCKHOLDERS_EQUITY`, `TICKER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_earning_calls_specific_md_en_1.0.0_3.0_1671134641020.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_earning_calls_specific_md_en_1.0.0_3.0_1671134641020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
ner_model = finance.NerModel.pretrained("finner_earning_calls_specific_md", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""Adjusted EPS was ahead of our expectations at $ 1.21 , and free cash flow is also ahead of our expectations despite a $ 1.5 billion additional tax payment we made related to the R&D amortization."""]]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("text"),
F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False)
```
## Results
```bash
+------------+----------+----------+
| token| ner_label|confidence|
+------------+----------+----------+
| Adjusted| B-PROFIT| 0.6179|
| EPS| I-PROFIT| 0.913|
| was| O| 1.0|
| ahead| O| 1.0|
| of| O| 1.0|
| our| O| 1.0|
|expectations| O| 1.0|
| at| O| 1.0|
| $|B-CURRENCY| 1.0|
| 1.21| B-AMOUNT| 1.0|
| ,| O| 1.0|
| and| O| 1.0|
| free| B-FCF| 0.9992|
| cash| I-FCF| 0.9945|
| flow| I-FCF| 0.9988|
| is| O| 1.0|
| also| O| 1.0|
| ahead| O| 1.0|
| of| O| 1.0|
| our| O| 1.0|
|expectations| O| 1.0|
| despite| O| 1.0|
| a| O| 1.0|
| $|B-CURRENCY| 1.0|
| 1.5| B-AMOUNT| 1.0|
| billion| I-AMOUNT| 1.0|
| additional| O| 0.9945|
| tax| O| 0.6131|
| payment| O| 0.6613|
| we| O| 1.0|
| made| O| 1.0|
| related| O| 1.0|
| to| O| 1.0|
| the| O| 1.0|
| R&D| O| 0.9994|
|amortization| O| 0.9989|
| .| O| 1.0|
+------------+----------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finner_earning_calls_specific_md|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.3 MB|
## References
In-house annotations on Earning Calls.
## Benchmarking
```bash
label precision recall f1 support
AMOUNT 99.303136 99.650350 99.476440 574
ASSET 55.172414 47.058824 50.793651 29
ASSET_INCREASE 100.000000 33.333333 50.000000 1
CF 46.153846 70.588235 55.813953 26
CFO 77.777778 100.000000 87.500000 9
CONTRA_LIABILITY 52.380952 56.410256 54.320988 42
COUNT 65.384615 77.272727 70.833333 26
CURRENCY 98.916968 99.636364 99.275362 554
DATE 86.982249 93.630573 90.184049 169
EXPENSE 67.187500 57.333333 61.870504 64
EXPENSE_DECREASE 100.000000 60.000000 75.000000 3
EXPENSE_INCREASE 40.000000 44.444444 42.105263 10
FCF 75.000000 75.000000 75.000000 20
INCOME 60.000000 40.000000 48.000000 10
KPI 41.666667 23.809524 30.303030 12
KPI_DECREASE 20.000000 10.000000 13.333333 5
KPI_INCREASE 44.444444 38.095238 41.025641 18
LIABILITY 38.461538 38.461538 38.461538 13
LIABILITY_DECREASE 50.000000 66.666667 57.142857 4
LOSS 50.000000 37.500000 42.857143 6
ORG 94.736842 90.000000 92.307692 19
PERCENTAGE 99.299475 99.648506 99.473684 571
PROFIT 78.014184 85.937500 81.784387 141
PROFIT_DECLINE 100.000000 36.363636 53.333333 4
PROFIT_INCREASE 78.947368 75.000000 76.923077 19
REVENUE 64.835165 71.951220 68.208092 91
REVENUE_DECLINE 53.571429 57.692308 55.555556 28
REVENUE_INCREASE 65.734266 75.200000 70.149254 143
STOCKHOLDERS_EQUITY 60.000000 37.500000 46.153846 5
TICKER 94.444444 94.444444 94.444444 18
accuracy - - 0.9571 19083
macro-avg 0.6660 0.5900 0.6070 19083
weighted-avg 0.9575 0.9571 0.9563 19083
```
---
layout: model
title: Legal Note Purchase Agreement Document Classifier (Longformer)
author: John Snow Labs
name: legclf_note_purchase_agreement
date: 2022-11-24
tags: [en, legal, classification, agreement, note_purchase, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_note_purchase_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `note-purchase-agreement` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`note-purchase-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_note_purchase_agreement_en_1.0.0_3.0_1669292848323.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_note_purchase_agreement_en_1.0.0_3.0_1669292848323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[note-purchase-agreement]|
|[other]|
|[other]|
|[note-purchase-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_note_purchase_agreement|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.4 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
note-purchase-agreement 0.90 0.92 0.91 38
other 0.97 0.96 0.96 90
accuracy - - 0.95 128
macro-avg 0.93 0.94 0.93 128
weighted-avg 0.95 0.95 0.95 128
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from holtin)
author: John Snow Labs
name: distilbert_qa_base_uncased_holtin_finetuned_full_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-holtin-finetuned-full-squad` is a English model originally trained by `holtin`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_full_squad_en_4.3.0_3.0_1672773896128.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_full_squad_en_4.3.0_3.0_1672773896128.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_full_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_full_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_holtin_finetuned_full_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/holtin/distilbert-base-uncased-holtin-finetuned-full-squad
---
layout: model
title: Embeddings Sciwiki 150 dims
author: John Snow Labs
name: embeddings_sciwiki_150d
class: WordEmbeddingsModel
language: es
repository: clinical/models
date: 2020-05-27
task: Embeddings
edition: Healthcare NLP 2.5.0
spark_version: 2.4
tags: [clinical,embeddings,es]
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_sciwiki_150d_es_2.5.0_2.4_1590609340084.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_sciwiki_150d_es_2.5.0_2.4_1590609340084.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
model = WordEmbeddingsModel.pretrained("embeddings_sciwiki_150d","es","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
```
```scala
val model = WordEmbeddingsModel.pretrained("embeddings_sciwiki_150d","es","clinical/models")
.setInputCols("document","token")
.setOutputCol("word_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("es.embed.sciwiki.150d").predict("""Put your text here.""")
```
{:.h2_title}
## Results
Word2Vec feature vectors based on ``embeddings_sciwiki_150d``.
{:.model-param}
## Model Information
{:.table-model}
|---------------|-------------------------|
| Name: | embeddings_sciwiki_150d |
| Type: | WordEmbeddingsModel |
| Compatibility: | Spark NLP 2.5.0+ |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [document, token] |
|Output labels: | [word_embeddings] |
| Language: | es |
| Dimension: | 150.0 |
{:.h2_title}
## Data Source
Trained on Clinical Wikipedia Articles
https://zenodo.org/record/3744326#.XtViinVKh_U
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_6_h_128
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-128` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_128_zh_4.2.4_3.0_1670021708476.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_128_zh_4.2.4_3.0_1670021708476.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_128","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_128","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_6_h_128|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|15.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-6_H-128
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Sentence Embeddings - sbert mini (tuned)
author: John Snow Labs
name: sbert_jsl_mini_umls_uncased
date: 2021-05-14
tags: [embeddings, clinical, licensed, en]
task: Embeddings
language: en
nav_key: models
edition: Healthcare NLP 3.0.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained to generate contextual sentence embeddings of input sentences.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_mini_umls_uncased_en_3.0.3_2.4_1621017142607.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_mini_umls_uncased_en_3.0.3_2.4_1621017142607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sbiobert_embeddings = BertSentenceEmbeddings\
.pretrained("sbert_jsl_mini_umls_uncased","en","clinical/models")\
.setInputCols(["sentence"])\
.setOutputCol("sbert_embeddings")
```
```scala
val sbiobert_embeddings = BertSentenceEmbeddings
.pretrained("sbert_jsl_mini_umls_uncased","en","clinical/models")
.setInputCols(Array("sentence"))
.setOutputCol("sbert_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed_sentence.bert.jsl_mini_umlsuncased").predict("""Put your text here.""")
```
## Results
```bash
Gives a 768 dimensional vector representation of the sentence.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbert_jsl_mini_umls_uncased|
|Compatibility:|Healthcare NLP 3.0.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Case sensitive:|false|
## Data Source
Tuned on MedNLI and UMLS dataset
## Benchmarking
```bash
MedNLI Score
Acc 0.677
STS(cos) 0.681
```
---
layout: model
title: Legal Custodian Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_custodian_agreement_bert
date: 2022-11-24
tags: [en, legal, classification, agreement, custodian, licensed, bert]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_custodian_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `custodian-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`custodian-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_custodian_agreement_bert_en_1.0.0_3.0_1669310583748.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_custodian_agreement_bert_en_1.0.0_3.0_1669310583748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[custodian-agreement]|
|[other]|
|[other]|
|[custodian-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_custodian_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
custodian-agreement 0.98 0.93 0.95 43
other 0.96 0.99 0.98 82
accuracy - - 0.97 125
macro-avg 0.97 0.96 0.96 125
weighted-avg 0.97 0.97 0.97 125
```
---
layout: model
title: Italian DistilBertForMaskedLM Cased model (from indigo-ai)
author: John Snow Labs
name: distilbert_embeddings_bertino
date: 2022-12-12
tags: [it, open_source, distilbert_embeddings, distilbertformaskedlm]
task: Embeddings
language: it
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BERTino` is a Italian model originally trained by `indigo-ai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_bertino_it_4.2.4_3.0_1670864710883.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_bertino_it_4.2.4_3.0_1670864710883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_bertino","it") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(False)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_bertino","it")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(false)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_bertino|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|it|
|Size:|253.3 MB|
|Case sensitive:|false|
## References
- https://huggingface.co/indigo-ai/BERTino
- https://indigo.ai/en/
- https://www.corpusitaliano.it/
- https://corpora.dipintra.it/public/run.cgi/corp_info?corpname=itwac_full
- https://universaldependencies.org/treebanks/it_partut/index.html
- https://universaldependencies.org/treebanks/it_isdt/index.html
- https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
---
layout: model
title: Detect PHI for Deidentification (Augmented)
author: John Snow Labs
name: ner_deid_augmented
date: 2021-01-20
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP for Healthcare 2.7.1
spark_version: 2.4
tags: [en, deidentify, ner, clinical, licensed]
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Deidentification NER (Augmented) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified.
We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/)
## Predicted Entities
`AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/3de6f25c23cd487d829ac3ce444ef19cfbe02631/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentificiation.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_en_2.7.1_2.4_1611145829422.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_en_2.7.1_2.4_1611145829422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
This model is trained with the ‘embeddings_clinical’ word embeddings, so be sure to use the same embeddings within the pipeline in addition to document assembler, sentence detector, tokenizer and ner converter .
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained("ner_deid_augmented","en","clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('ner_chunk')
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. ']], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("ner_deid_augmented","en","clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token", "ner"))
.setOutputCol("ner_chunk")
val nlpPipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter))
val result = pipeline.fit(Seq.empty[String]).transform(data)
val results = LightPipeline(model).fullAnnotate("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.deid.augmented").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. """)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_super_1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_super_1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_super_1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/nbroad/rob-base-superqa1
- https://paperswithcode.com/sota?task=Question+Answering&dataset=adversarial_qa
---
layout: model
title: Fast Neural Machine Translation Model from Pedi to English
author: John Snow Labs
name: opus_mt_nso_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, nso, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `nso`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_nso_en_xx_2.7.0_2.4_1609170845142.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_nso_en_xx_2.7.0_2.4_1609170845142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_nso_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_nso_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.nso.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_nso_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Finance Capital Call Notices Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: finclf_capital_call_notices
date: 2023-02-16
tags: [en, licensed, finance, capital_calls, classification, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `finclf_capital_call_notices` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `capital_call_notices` or not (Binary Classification).
## Predicted Entities
`capital_call_notices`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_capital_call_notices_en_1.0.0_3.0_1676590287518.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_capital_call_notices_en_1.0.0_3.0_1676590287518.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[capital_call_notices]|
|[other]|
|[other]|
|[capital_call_notices]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_capital_call_notices|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Financial documents and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
capital_call_notices 1.00 1.00 1.00 12
other 1.00 1.00 1.00 23
accuracy - - 1.00 35
macro-avg 1.00 1.00 1.00 35
weighted-avg 1.00 1.00 1.00 35
```
---
layout: model
title: Sentiment Analysis of IMDB Reviews Pipeline (analyze_sentimentdl_glove_imdb)
author: John Snow Labs
name: analyze_sentimentdl_glove_imdb
date: 2021-01-15
task: [Embeddings, Sentiment Analysis, Pipeline Public]
language: en
nav_key: models
edition: Spark NLP 2.7.1
spark_version: 2.4
tags: [sentiment, en, pipeline]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pre-trained pipeline to classify IMDB reviews in `neg` and `pos` classes using `glove_100d` embeddings.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_glove_imdb_en_2.7.1_2.4_1610722058784.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_glove_imdb_en_2.7.1_2.4_1610722058784.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("analyze_sentimentdl_glove_imdb", lang = "en")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("analyze_sentimentdl_glove_imdb", lang = "en")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.sentiment.glove").predict("""Put your text here.""")
```
## Results
```bash
| | document | sentiment |
|---:|---------------------------------------------------------------------------------------------------------:|--------------:|
| | Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the | |
| 0 | film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music | positive |
| | was rad! Horror and sword fight freaks,buy this movie now! | |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|analyze_sentimentdl_glove_imdb|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.1+|
|Edition:|Official|
|Language:|en|
## Included Models
`glove_100d`, `sentimentdl_glove_imdb`
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from anurag0077)
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_squad3
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad3` is a English model originally trained by `anurag0077`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad3_en_4.3.0_3.0_1672773665178.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad3_en_4.3.0_3.0_1672773665178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad3","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_squad3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anurag0077/distilbert-base-uncased-finetuned-squad3
---
layout: model
title: Translate Hiligaynon to English Pipeline
author: John Snow Labs
name: translate_hil_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, hil, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `hil`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_hil_en_xx_2.7.0_2.4_1609686032449.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_hil_en_xx_2.7.0_2.4_1609686032449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_hil_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_hil_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.hil.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_hil_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForTokenClassification Cased model (from ml6team)
author: John Snow Labs
name: distilbert_token_classifier_keyphrase_extraction_openkp
date: 2023-03-03
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-openkp` is a English model originally trained by `ml6team`.
## Predicted Entities
`KEY`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_openkp_en_4.3.0_3.0_1677880905122.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_openkp_en_4.3.0_3.0_1677880905122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_openkp","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_openkp","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_keyphrase_extraction_openkp|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ml6team/keyphrase-extraction-distilbert-openkp
- https://github.com/microsoft/OpenKP
- https://arxiv.org/abs/1911.02671
- https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=openkp
---
layout: model
title: Word2Vec Embeddings in Bihari languages (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-14
tags: [cc, embeddings, fastText, word2vec, bh, open_source]
task: Embeddings
language: bh
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bh_3.4.1_3.0_1647286940542.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bh_3.4.1_3.0_1647286940542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("bh.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|bh|
|Size:|77.7 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Translate Tigrinya to English Pipeline
author: John Snow Labs
name: translate_ti_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, ti, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `ti`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ti_en_xx_2.7.0_2.4_1609689546761.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ti_en_xx_2.7.0_2.4_1609689546761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_ti_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_ti_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.ti.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_ti_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Danish asr_alvenir_wav2vec2_base_nst_cv9 TFWav2Vec2ForCTC from chcaa
author: John Snow Labs
name: pipeline_asr_alvenir_wav2vec2_base_nst_cv9
date: 2022-09-25
tags: [wav2vec2, da, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: da
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_alvenir_wav2vec2_base_nst_cv9` is a Danish model originally trained by chcaa.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_alvenir_wav2vec2_base_nst_cv9_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_alvenir_wav2vec2_base_nst_cv9_da_4.2.0_3.0_1664104731248.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_alvenir_wav2vec2_base_nst_cv9_da_4.2.0_3.0_1664104731248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_alvenir_wav2vec2_base_nst_cv9', lang = 'da')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_alvenir_wav2vec2_base_nst_cv9", lang = "da")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_alvenir_wav2vec2_base_nst_cv9|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|da|
|Size:|226.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Translate Thai to English Pipeline
author: John Snow Labs
name: translate_th_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, th, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `th`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_th_en_xx_2.7.0_2.4_1609689519812.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_th_en_xx_2.7.0_2.4_1609689519812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_th_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_th_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.th.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_th_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Detect Diseases in Medical Text
author: John Snow Labs
name: bert_token_classifier_ner_bc5cdr_disease_pipeline
date: 2023-03-20
tags: [en, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_bc5cdr_disease](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_bc5cdr_disease_en_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_disease_pipeline_en_4.3.0_3.2_1679302082722.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_disease_pipeline_en_4.3.0_3.2_1679302082722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_bc5cdr_disease_pipeline", "en", "clinical/models")
text = '''Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bc5cdr_disease_pipeline", "en", "clinical/models")
val text = "Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:----------------------|--------:|------:|:------------|-------------:|
| 0 | interstitial cystitis | 61 | 81 | DISEASE | 0.999746 |
| 1 | mastocytosis | 129 | 140 | DISEASE | 0.999132 |
| 2 | cystitis | 209 | 216 | DISEASE | 0.999912 |
| 3 | prostate cancer | 355 | 369 | DISEASE | 0.999781 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_bc5cdr_disease_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: English BertForQuestionAnswering model (from LoudlySoft)
author: John Snow Labs
name: bert_qa_scibert_scivocab_uncased_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `scibert_scivocab_uncased_squad` is a English model orginally trained by `LoudlySoft`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_scibert_scivocab_uncased_squad_en_4.0.0_3.0_1654189441461.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_scibert_scivocab_uncased_squad_en_4.0.0_3.0_1654189441461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_scibert_scivocab_uncased_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_scibert_scivocab_uncased_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.scibert.uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_scibert_scivocab_uncased_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|410.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/LoudlySoft/scibert_scivocab_uncased_squad
---
layout: model
title: Financial Finetuned FLAN-T5 Text Generation ( Financial Alpaca )
author: John Snow Labs
name: fingen_flant5_finetuned_alpaca
date: 2023-05-25
tags: [en, finance, generation, licensed, flant5, alpaca, tensorflow]
task: Text Generation
language: en
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: FinanceTextGenerator
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `fingen_flant5_finetuned_alpaca` model is the Text Generation model that has been fine-tuned on FLAN-T5 using Financial Alpaca dataset. FLAN-T5 is a state-of-the-art language model developed by Facebook AI that utilizes the T5 architecture for text-generation tasks.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/fingen_flant5_finetuned_alpaca_en_1.0.0_3.0_1685016665729.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/fingen_flant5_finetuned_alpaca_en_1.0.0_3.0_1685016665729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
flant5 = finance.TextGenerator.pretrained("fingen_flant5_finetuned_alpaca", "en", "finance/models")\
.setInputCols(["document"])\
.setOutputCol("generated")\
.setMaxNewTokens(256)\
.setStopAtEos(True)\
.setDoSample(True)\
.setTopK(3)
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])
data = spark.createDataFrame([
[1, "What is the US Fair Tax?"]]).toDF('id', 'text')
results = pipeline.fit(data).transform(data)
results.select("generated.result").show(truncate=False)
```
## Results
```bash
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Fair tax in the US is essentially an income tax. Fair taxes are tax on your income, and are not taxeable in any country. Fair taxes are taxed as income. If you have a net gain or if the loss of income from taxable activities is less then the fair value (the loss) of your gross income (the loss) then you have to file an Income Report. This will give the US government an overview and give you an understanding. If your net income is less that your fair share of your gross income (which you are entitled) you have the right to claim a refund.]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|fingen_flant5_finetuned_alpaca|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.6 GB|
## References
The dataset is available [here](https://huggingface.co/datasets/gbharti/finance-alpaca/viewer/gbharti--finance-alpaca)
---
layout: model
title: English Deberta Embeddings model (from ZZ99)
author: John Snow Labs
name: deberta_embeddings_tapt_nbme_v3_base
date: 2023-03-13
tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow]
task: Embeddings
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DeBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tapt_nbme_deberta_v3_base` is a English model originally trained by `ZZ99`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_tapt_nbme_v3_base_en_4.3.1_3.0_1678712713960.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_tapt_nbme_v3_base_en_4.3.1_3.0_1678712713960.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_tapt_nbme_v3_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_tapt_nbme_v3_base","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|deberta_embeddings_tapt_nbme_v3_base|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|689.5 MB|
|Case sensitive:|false|
## References
https://huggingface.co/ZZ99/tapt_nbme_deberta_v3_base
---
layout: model
title: Detect Living Species (bert_base_cased)
author: John Snow Labs
name: ner_living_species_bert
date: 2022-06-23
tags: [ro, ner, clinical, licensed, bert]
task: Named Entity Recognition
language: ro
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract living species from clinical texts in Romanian which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `bert_base_cased` embeddings.
It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others.
**NOTE :**
1. The text files were translated from Spanish with a neural machine translation system.
2. The annotations were translated with the same neural machine translation system.
3. The translated annotations were transferred to the translated text files using an annotation transfer technology.
## Predicted Entities
`HUMAN`, `SPECIES`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_ro_3.5.3_3.0_1655974560466.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_ro_3.5.3_3.0_1655974560466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "ro", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""O femeie în vârstă de 26 de ani, însărcinată în 11 săptămâni, a consultat serviciul de urgențe dermatologice pentru că prezenta, de 4 zile, leziuni punctiforme dureroase de debut brusc pe vârful degetelor. Pacientul raportează că leziunile au început pe degete și ulterior s-au extins la degetele de la picioare. Markerii de imunitate, ANA și crioagglutininele, au fost negativi, iar serologia VHB a indicat doar vaccinarea. Pe baza acestor rezultate, diagnosticul de vasculită a fost exclus și, având în vedere diagnosticul suspectat de erupție cutanată cu mănuși și șosete, s-a efectuat serologia pentru virusul Ebstein Barr. Exantemă la mănuși și șosete datorat parvovirozei B19. Având în vedere suspiciunea unei afecțiuni infecțioase cu aceste caracteristici, a fost solicitată serologia pentru EBV, enterovirus și parvovirus B19, cu IgM pozitiv pentru acesta din urmă în două ocazii. De asemenea, nu au existat semne de anemie fetală sau complicații ale acesteia."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "ro", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter))
val data = Seq("""O femeie în vârstă de 26 de ani, însărcinată în 11 săptămâni, a consultat serviciul de urgențe dermatologice pentru că prezenta, de 4 zile, leziuni punctiforme dureroase de debut brusc pe vârful degetelor. Pacientul raportează că leziunile au început pe degete și ulterior s-au extins la degetele de la picioare. Markerii de imunitate, ANA și crioagglutininele, au fost negativi, iar serologia VHB a indicat doar vaccinarea. Pe baza acestor rezultate, diagnosticul de vasculită a fost exclus și, având în vedere diagnosticul suspectat de erupție cutanată cu mănuși și șosete, s-a efectuat serologia pentru virusul Ebstein Barr. Exantemă la mănuși și șosete datorat parvovirozei B19. Având în vedere suspiciunea unei afecțiuni infecțioase cu aceste caracteristici, a fost solicitată serologia pentru EBV, enterovirus și parvovirus B19, cu IgM pozitiv pentru acesta din urmă în două ocazii. De asemenea, nu au existat semne de anemie fetală sau complicații ale acesteia.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ro.med_ner.living_species.bert").predict("""O femeie în vârstă de 26 de ani, însărcinată în 11 săptămâni, a consultat serviciul de urgențe dermatologice pentru că prezenta, de 4 zile, leziuni punctiforme dureroase de debut brusc pe vârful degetelor. Pacientul raportează că leziunile au început pe degete și ulterior s-au extins la degetele de la picioare. Markerii de imunitate, ANA și crioagglutininele, au fost negativi, iar serologia VHB a indicat doar vaccinarea. Pe baza acestor rezultate, diagnosticul de vasculită a fost exclus și, având în vedere diagnosticul suspectat de erupție cutanată cu mănuși și șosete, s-a efectuat serologia pentru virusul Ebstein Barr. Exantemă la mănuși și șosete datorat parvovirozei B19. Având în vedere suspiciunea unei afecțiuni infecțioase cu aceste caracteristici, a fost solicitată serologia pentru EBV, enterovirus și parvovirus B19, cu IgM pozitiv pentru acesta din urmă în două ocazii. De asemenea, nu au existat semne de anemie fetală sau complicații ale acesteia.""")
```
## Results
```bash
+--------------------+-------+
|ner_chunk |label |
+--------------------+-------+
|femeie |HUMAN |
|Pacientul |HUMAN |
|VHB |SPECIES|
|virusul Ebstein Barr|SPECIES|
|parvovirozei B19 |SPECIES|
|EBV |SPECIES|
|enterovirus |SPECIES|
|parvovirus B19 |SPECIES|
|fetală |HUMAN |
+--------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_living_species_bert|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ro|
|Size:|16.4 MB|
## References
[https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/)
## Benchmarking
```bash
label precision recall f1-score support
B-HUMAN 0.85 0.94 0.89 2184
B-SPECIES 0.75 0.85 0.80 2617
I-HUMAN 0.89 0.11 0.20 72
I-SPECIES 0.74 0.80 0.77 1027
micro-avg 0.79 0.86 0.82 5900
macro-avg 0.81 0.67 0.66 5900
weighted-avg 0.79 0.86 0.82 5900
```
---
layout: model
title: English DistilBertForTokenClassification Cased model (from ml6team)
author: John Snow Labs
name: distilbert_token_classifier_keyphrase_extraction_openkp
date: 2023-03-14
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-openkp` is a English model originally trained by `ml6team`.
## Predicted Entities
`KEY`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_openkp_en_4.3.1_3.0_1678782889694.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_openkp_en_4.3.1_3.0_1678782889694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_openkp","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_openkp","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_keyphrase_extraction_openkp|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ml6team/keyphrase-extraction-distilbert-openkp
- https://github.com/microsoft/OpenKP
- https://arxiv.org/abs/1911.02671
- https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=openkp
---
layout: model
title: English image_classifier_vit_vision_transformer_fmri_classification_ft ViTForImageClassification from shivkumarganesh
author: John Snow Labs
name: image_classifier_vit_vision_transformer_fmri_classification_ft
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vision_transformer_fmri_classification_ft` is a English model originally trained by shivkumarganesh.
## Predicted Entities
`test`, `train`, `val`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vision_transformer_fmri_classification_ft_en_4.1.0_3.0_1660166000402.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vision_transformer_fmri_classification_ft_en_4.1.0_3.0_1660166000402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_vision_transformer_fmri_classification_ft", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_vision_transformer_fmri_classification_ft", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_vision_transformer_fmri_classification_ft|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Legal Special terms and conditions of trust Clause Binary Classifier
author: John Snow Labs
name: legclf_special_terms_and_conditions_of_trust_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `special-terms-and-conditions-of-trust` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `special-terms-and-conditions-of-trust`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_special_terms_and_conditions_of_trust_clause_en_1.0.0_3.2_1660123013530.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_special_terms_and_conditions_of_trust_clause_en_1.0.0_3.2_1660123013530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[special-terms-and-conditions-of-trust]|
|[other]|
|[other]|
|[special-terms-and-conditions-of-trust]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_special_terms_and_conditions_of_trust_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 1.00 1.00 132
special-terms-and-conditions-of-trust 1.00 1.00 1.00 56
accuracy - - 1.00 188
macro-avg 1.00 1.00 1.00 188
weighted-avg 1.00 1.00 1.00 188
```
---
layout: model
title: Movies Sentiment Analysis
author: John Snow Labs
name: movies_sentiment_analysis
date: 2022-07-06
tags: [en, open_source]
task: Sentiment Analysis
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The movies_sentiment_analysis is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and predicts sentiment .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/movies_sentiment_analysis_en_4.0.0_3.0_1657135804995.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/movies_sentiment_analysis_en_4.0.0_3.0_1657135804995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("movies_sentiment_analysis", "en")
result = pipeline.annotate("""I love johnsnowlabs! """)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|movies_sentiment_analysis|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|210.0 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- SymmetricDeleteModel
- SentimentDetectorModel
---
layout: model
title: Word2Vec Embeddings in Galician (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-15
tags: [cc, embeddings, fastText, word2vec, gl, open_source]
task: Embeddings
language: gl
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_gl_3.4.1_3.0_1647374243984.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_gl_3.4.1_3.0_1647374243984.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Eu amo a faísca NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Eu amo a faísca NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("gl.embed.w2v_cc_300d").predict("""Eu amo a faísca NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|gl|
|Size:|779.2 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Multilingual XLMRoBerta Embeddings (from castorini)
author: John Snow Labs
name: xlmroberta_embeddings_afriberta_small
date: 2022-05-13
tags: [ha, yo, ig, am, so, open_source, xlm_roberta, embeddings, xx, afriberta]
task: Embeddings
language: xx
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: XlmRoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `afriberta_small` is a Multilingual model orginally trained by `castorini`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_small_xx_3.4.4_3.0_1652439280261.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_small_xx_3.4.4_3.0_1652439280261.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_small","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_small","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_embeddings_afriberta_small|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Size:|311.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/castorini/afriberta_small
- https://github.com/keleog/afriberta
---
layout: model
title: Smaller BERT Sentence Embeddings (L-6_H-256_A-4)
author: John Snow Labs
name: sent_small_bert_L6_256
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L6_256_en_2.6.0_2.4_1598350409969.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L6_256_en_2.6.0_2.4_1598350409969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L6_256", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L6_256", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.small_bert_L6_256').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
en_embed_sentence_small_bert_L6_256_embeddings sentence
[0.7711525559425354, 0.5496315956115723, 1.261... I hate cancer
[0.28574034571647644, -0.03116176463663578, 1.... Antibiotics aren't painkiller
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_small_bert_L6_256|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|256|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1
---
layout: model
title: Legal Injunctive relief Clause Binary Classifier
author: John Snow Labs
name: legclf_injunctive_relief_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `injunctive-relief` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `injunctive-relief`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_injunctive_relief_clause_en_1.0.0_3.2_1660122542368.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_injunctive_relief_clause_en_1.0.0_3.2_1660122542368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[injunctive-relief]|
|[other]|
|[other]|
|[injunctive-relief]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_injunctive_relief_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
injunctive-relief 0.91 1.00 0.95 30
other 1.00 0.97 0.99 103
accuracy - - 0.98 133
macro-avg 0.95 0.99 0.97 133
weighted-avg 0.98 0.98 0.98 133
```
---
layout: model
title: Translate English to Hiri Motu Pipeline
author: John Snow Labs
name: translate_en_ho
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, ho, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `ho`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ho_xx_2.7.0_2.4_1609691430837.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ho_xx_2.7.0_2.4_1609691430837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_ho", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_ho", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.ho').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_ho|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Full disclosure Clause Binary Classifier
author: John Snow Labs
name: legclf_full_disclosure_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `full-disclosure` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `full-disclosure`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_full_disclosure_clause_en_1.0.0_3.2_1660122467527.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_full_disclosure_clause_en_1.0.0_3.2_1660122467527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[full-disclosure]|
|[other]|
|[other]|
|[full-disclosure]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_full_disclosure_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.1 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
full-disclosure 1.00 0.94 0.97 31
other 0.98 1.00 0.99 104
accuracy - - 0.99 135
macro-avg 0.99 0.97 0.98 135
weighted-avg 0.99 0.99 0.99 135
```
---
layout: model
title: Pipeline to Summarize Clinical Question Notes
author: John Snow Labs
name: summarizer_clinical_questions_pipeline
date: 2023-05-29
tags: [licensed, en, clinical, summarization, question]
task: Summarization
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [summarizer_clinical_questions](https://nlp.johnsnowlabs.com/2023/04/03/summarizer_clinical_questions_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_pipeline_en_4.4.2_3.0_1685401048463.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_pipeline_en_4.4.2_3.0_1685401048463.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("summarizer_clinical_questions_pipeline", "en", "clinical/models")
text = """
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
"""
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("summarizer_clinical_questions_pipeline", "en", "clinical/models")
val text = """
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
"""
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
What are the treatments for hyperthyroidism?
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|summarizer_clinical_questions_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|936.7 MB|
## Included Models
- DocumentAssembler
- MedicalSummarizer
---
layout: model
title: Chinese BertForQuestionAnswering model (from uer)
author: John Snow Labs
name: bert_qa_roberta_base_chinese_extractive_qa
date: 2022-06-02
tags: [zh, open_source, question_answering, bert]
task: Question Answering
language: zh
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-chinese-extractive-qa` is a Chinese model orginally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_roberta_base_chinese_extractive_qa_zh_4.0.0_3.0_1654189258198.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_roberta_base_chinese_extractive_qa_zh_4.0.0_3.0_1654189258198.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_roberta_base_chinese_extractive_qa","zh") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_roberta_base_chinese_extractive_qa","zh")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.answer_question.bert.base.by_uer").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_roberta_base_chinese_extractive_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|zh|
|Size:|381.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/uer/roberta-base-chinese-extractive-qa
- https://spaces.ac.cn/archives/4338
- https://www.kesci.com/home/competition/5d142d8cbb14e6002c04e14a/content/0
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/product/tione/
- https://github.com/ymcui/cmrc2018
---
layout: model
title: English RobertaForQuestionAnswering (from nlpconnect)
author: John Snow Labs
name: roberta_qa_roberta_base_squad2_nq
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-nq` is a English model originally trained by `nlpconnect`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_nq_en_4.0.0_3.0_1655735618263.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_nq_en_4.0.0_3.0_1655735618263.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_nq","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_squad2_nq","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base.by_nlpconnect").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_squad2_nq|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/nlpconnect/roberta-base-squad2-nq
---
layout: model
title: German T5ForConditionalGeneration Base Cased model (from Einmalumdiewelt)
author: John Snow Labs
name: t5_base_gnad_maxsamples
date: 2023-01-30
tags: [de, open_source, t5, tensorflow]
task: Text Generation
language: de
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5-Base_GNAD_MaxSamples` is a German model originally trained by `Einmalumdiewelt`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_gnad_maxsamples_de_4.3.0_3.0_1675099257674.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_gnad_maxsamples_de_4.3.0_3.0_1675099257674.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_base_gnad_maxsamples","de") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_base_gnad_maxsamples","de")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_base_gnad_maxsamples|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|de|
|Size:|922.8 MB|
## References
- https://huggingface.co/Einmalumdiewelt/T5-Base_GNAD_MaxSamples
---
layout: model
title: Context Spell Checker for the English Language
author: John Snow Labs
name: spellcheck_dl
date: 2021-03-28
tags: [en, open_source]
supported: true
task: Spell Check
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: ContextSpellCheckerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your input text. It’s based on Levenshtein Automaton for generating candidate corrections and a Neural Language Model for ranking corrections.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_en_3.0.0_3.0_1616900699393.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_en_3.0.0_3.0_1616900699393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
The model works at the token level, so you must put it after tokenization. The model can change the length of the tokens when correcting words, so keep this in mind when using it before other annotators that may work with absolute references to the original document like NerConverter.
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|spellcheck_dl|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[corrected]|
|Language:|en|
## Data Source
American National Corpus.
---
layout: model
title: English image_classifier_vit_modeversion28_7 ViTForImageClassification from sudo-s
author: John Snow Labs
name: image_classifier_vit_modeversion28_7
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_modeversion28_7` is a English model originally trained by sudo-s.
## Predicted Entities
`45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion28_7_en_4.1.0_3.0_1660168471234.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion28_7_en_4.1.0_3.0_1660168471234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_modeversion28_7", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_modeversion28_7", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_modeversion28_7|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|322.3 MB|
---
layout: model
title: BioBERT Embeddings (Discharge)
author: John Snow Labs
name: biobert_discharge_base_cased
date: 2020-09-19
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.2
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a pre-trained weights of ClinicalBERT for discharge summaries. This domain-specific model has performance improvements on 3/5 clinical NLP tasks andd establishing a new state-of-the-art on the MedNLI dataset. The details are described in the paper "[Publicly Available Clinical BERT Embeddings](https://www.aclweb.org/anthology/W19-1909/)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_discharge_base_cased_en_2.6.2_2.4_1600531401858.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_discharge_base_cased_en_2.6.2_2.4_1600531401858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("biobert_discharge_base_cased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("biobert_discharge_base_cased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I hate cancer").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer"]
embeddings_df = nlu.load('en.embed.biobert.discharge_base_cased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_biobert_discharge_base_cased_embeddings
I [0.0036486536264419556, 0.3796533942222595, -0...
hate [0.1914958357810974, 0.6709488034248352, -0.49...
cancer [0.04618441313505173, -0.04562612622976303, -0...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|biobert_discharge_base_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.2|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/EmilyAlsentzer/clinicalBERT](https://github.com/EmilyAlsentzer/clinicalBERT)
---
layout: model
title: Dispute Clause Binary Classifier
author: John Snow Labs
name: legclf_dispute_clauses_cuad
date: 2023-01-18
tags: [en, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `dispute_clause` clause type. To use this model, make sure you provide enough context as an input.
Senteces have been used as positive examples, so better results will be achieved if SetenceDetector is added to the pipeline.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other 300+ Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`dispute_clause`, `other`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dispute_clauses_cuad_en_1.0.0_3.0_1674056674986.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dispute_clauses_cuad_en_1.0.0_3.0_1674056674986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = legal.ClassifierDLModel() \
.pretrained("legclf_dispute_clauses_cuad","en","legal/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("is_dispute_clause")
pipeline = nlp.Pipeline() \
.setStages(
[
documentAssembler,
embeddings,
docClassifier
]
)
fit_model = pipeline.fit(spark.createDataFrame([[""]]).toDF('text'))
lm = nlp.LightPipeline(fit_model)
pos_example = "24.2 The parties irrevocably agree that the courts of Ohio shall have non-exclusive jurisdiction to settle any dispute or claim that arises out of or in connection with this agreement or its subject matter or formation ( including non - contractual disputes or claims )."
neg_example = "Brokers’ Fees and Expenses Except as expressly set forth in the Transaction Documents to the contrary, each party shall pay the fees and expenses of its advisers, counsel, accountants and other experts, if any, and all other expenses incurred by such party incident to the negotiation, preparation, execution, delivery and performance of this Agreement. The Company shall pay all transfer agent fees, stamp taxes and other taxes and duties levied in connection with the delivery of any Warrant Shares to the Purchasers. Steel Pier Capital Advisors, LLC shall be reimbursed its expenses in having the Transaction Documents prepared on behalf of the Company and for its obligations under the Security Agreement in an amount not to exceed $25,000.00."
texts = [
pos_example,
neg_example
]
res = lm.annotate(texts)
```
## Results
```bash
['dispute_clause']
['other']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_dispute_clauses_cuad|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[label]|
|Language:|en|
|Size:|22.9 MB|
## References
Manual annotations of CUAD dataset
## Benchmarking
```bash
label precision recall f1-score support
dispute_clause 1.00 1.00 1.00 61
other 1.00 1.00 1.00 96
accuracy - - 1.00 157
macro-avg 1.00 1.00 1.00 157
weighted-avg 1.00 1.00 1.00 157
```
---
layout: model
title: Pipeline to Detect Problems, Tests and Treatments
author: John Snow Labs
name: ner_healthcare_pipeline
date: 2023-03-14
tags: [ner, licensed, clinical, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_healthcare](https://nlp.johnsnowlabs.com/2021/04/21/ner_healthcare_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_pipeline_en_4.3.0_3.2_1678824932575.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_pipeline_en_4.3.0_3.2_1678824932575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_healthcare_pipeline", "en", "clinical/models")
text = '''A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_healthcare_pipeline", "en", "clinical/models")
val text = "A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.healthcare_pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .""")
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:------------------------------|--------:|------:|:------------|-------------:|
| 0 | gestational diabetes mellitus | 39 | 67 | PROBLEM | 0.938233 |
| 1 | type two diabetes mellitus | 128 | 153 | PROBLEM | 0.762925 |
| 2 | HTG-induced pancreatitis | 186 | 209 | PROBLEM | 0.9742 |
| 3 | an acute hepatitis | 263 | 280 | PROBLEM | 0.915067 |
| 4 | obesity | 288 | 294 | PROBLEM | 0.9926 |
| 5 | a body mass index | 301 | 317 | TEST | 0.721175 |
| 6 | BMI | 321 | 323 | TEST | 0.4466 |
| 7 | polyuria | 380 | 387 | PROBLEM | 0.9987 |
| 8 | polydipsia | 391 | 400 | PROBLEM | 0.9993 |
| 9 | poor appetite | 404 | 416 | PROBLEM | 0.96315 |
| 10 | vomiting | 424 | 431 | PROBLEM | 0.9588 |
| 11 | amoxicillin | 511 | 521 | TREATMENT | 0.6453 |
| 12 | a respiratory tract infection | 527 | 555 | PROBLEM | 0.867 |
| 13 | metformin | 570 | 578 | TREATMENT | 0.9989 |
| 14 | glipizide | 582 | 590 | TREATMENT | 0.9997 |
| 15 | dapagliflozin | 598 | 610 | TREATMENT | 0.9996 |
| 16 | T2DM | 616 | 619 | TREATMENT | 0.9662 |
| 17 | atorvastatin | 625 | 636 | TREATMENT | 0.9993 |
| 18 | gemfibrozil | 642 | 652 | TREATMENT | 0.9997 |
| 19 | HTG | 658 | 660 | PROBLEM | 0.9927 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_healthcare_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|513.6 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English ElectraForQuestionAnswering model (from mrm8488) Version-2
author: John Snow Labs
name: electra_qa_base_finetuned_squadv2
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-finetuned-squadv2` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_finetuned_squadv2_en_4.0.0_3.0_1655920687688.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_finetuned_squadv2_en_4.0.0_3.0_1655920687688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_finetuned_squadv2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_finetuned_squadv2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.electra.base_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_base_finetuned_squadv2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|408.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/electra-base-finetuned-squadv2
---
layout: model
title: Sentence Entity Resolver for UMLS CUI Codes
author: John Snow Labs
name: sbiobertresolve_umls_major_concepts
date: 2021-05-02
tags: [en, clinical, licensed, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.2
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Map clinical entities to UMLS CUI codes.
## Predicted Entities
This model returns CUI (concept unique identifier) codes for `Clinical Findings`, `Medical Devices`, `Anatomical Structures` and `Injuries & Poisoning` terms
{:.btn-box}
[Live Demo](http://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_major_concepts_en_3.0.2_3.0_1619973285528.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_major_concepts_en_3.0.2_3.0_1619973285528.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_umls_major_concepts``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Cerebrovascular_Disease, Communicable_Disease, Diabetes, Disease_Syndrome_Disorder, Heart_Disease, Hyperlipidemia, Hypertension, Injury_or_Poisoning, Kidney_Disease, Medical-Device, Obesity, Oncological, Overweight, Psychological_Condition, Symptom, VS_Finding, ImagingFindings, EKG_Findings``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_umls_major_concepts", "en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
pipeline = Pipeline(stages = [document_assembler, sentence_detector, tokens, embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver])
data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text")
results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.umls").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_lucky_model", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_lucky_model", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_lucky_model|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|324.8 MB|
---
layout: model
title: English image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office ViTForImageClassification from mayoughi
author: John Snow Labs
name: image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office` is a English model originally trained by mayoughi.
## Predicted Entities
`office`, `balcony`, `restaurant`, `hospital`, `inside apartment`, `airport`, `hallway`, `inside coffee house`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office_en_4.1.0_3.0_1660172866138.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office_en_4.1.0_3.0_1660172866138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Google's Tapas Table Understanding (Medium, WTQ)
author: John Snow Labs
name: table_qa_tapas_medium_finetuned_wtq
date: 2022-09-30
tags: [en, table, qa, question, answering, open_source]
task: Table Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: TapasForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark.
Size of this model: Medium
Has aggregation operations?: True
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_medium_finetuned_wtq_en_4.2.0_3.0_1664530490771.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_medium_finetuned_wtq_en_4.2.0_3.0_1664530490771.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
json_data = """
{
"header": ["name", "money", "age"],
"rows": [
["Donald Trump", "$100,000,000", "75"],
["Elon Musk", "$20,000,000,000,000", "55"]
]
}
"""
queries = [
"Who earns less than 200,000,000?",
"Who earns 100,000,000?",
"How much money has Donald Trump?",
"How old are they?",
]
data = spark.createDataFrame([
[json_data, " ".join(queries)]
]).toDF("table_json", "questions")
document_assembler = MultiDocumentAssembler() \
.setInputCols("table_json", "questions") \
.setOutputCols("document_table", "document_questions")
sentence_detector = SentenceDetector() \
.setInputCols(["document_questions"]) \
.setOutputCol("questions")
table_assembler = TableAssembler()\
.setInputCols(["document_table"])\
.setOutputCol("table")
tapas = TapasForQuestionAnswering\
.pretrained("table_qa_tapas_medium_finetuned_wtq","en")\
.setInputCols(["questions", "table"])\
.setOutputCol("answers")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
table_assembler,
tapas
])
model = pipeline.fit(data)
model\
.transform(data)\
.selectExpr("explode(answers) AS answer")\
.select("answer")\
.show(truncate=False)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|answer |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} |
|{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|table_qa_tapas_medium_finetuned_wtq|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|157.5 MB|
|Case sensitive:|false|
## References
https://www.microsoft.com/en-us/download/details.aspx?id=54253
https://github.com/ppasupat/WikiTableQuestions
---
layout: model
title: Arabic ElectraForQuestionAnswering model (from salti)
author: John Snow Labs
name: electra_qa_AraElectra_base_finetuned_ARCD
date: 2022-06-22
tags: [ar, open_source, electra, question_answering]
task: Question Answering
language: ar
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `AraElectra-base-finetuned-ARCD` is a Arabic model originally trained by `salti`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_AraElectra_base_finetuned_ARCD_ar_4.0.0_3.0_1655918851105.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_AraElectra_base_finetuned_ARCD_ar_4.0.0_3.0_1655918851105.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_AraElectra_base_finetuned_ARCD","ar") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_AraElectra_base_finetuned_ARCD","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.answer_question.squad_arcd.electra.base").predict("""ما هو اسمي؟|||"اسمي كلارا وأنا أعيش في بيركلي.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_AraElectra_base_finetuned_ARCD|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|ar|
|Size:|504.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/salti/AraElectra-base-finetuned-ARCD
---
layout: model
title: Legal Building And Public Works Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_building_and_public_works_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, building_and_public_works, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_building_and_public_works_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Building_and_Public_Works or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Building_and_Public_Works`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_building_and_public_works_bert_en_1.0.0_3.0_1678111597578.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_building_and_public_works_bert_en_1.0.0_3.0_1678111597578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Building_and_Public_Works]|
|[Other]|
|[Other]|
|[Building_and_Public_Works]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_building_and_public_works_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.8 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Building_and_Public_Works 0.85 0.85 0.85 33
Other 0.87 0.87 0.87 39
accuracy - - 0.86 72
macro-avg 0.86 0.86 0.86 72
weighted-avg 0.86 0.86 0.86 72
```
---
layout: model
title: English T5ForConditionalGeneration Cased model (from ThomasNLG)
author: John Snow Labs
name: t5_weighter_cnndm
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-weighter_cnndm-en` is a English model originally trained by `ThomasNLG`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_weighter_cnndm_en_4.3.0_3.0_1675156764080.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_weighter_cnndm_en_4.3.0_3.0_1675156764080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_weighter_cnndm","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_weighter_cnndm","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_weighter_cnndm|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|277.8 MB|
## References
- https://huggingface.co/ThomasNLG/t5-weighter_cnndm-en
- https://github.com/ThomasScialom/QuestEval
- https://arxiv.org/abs/2103.12693
---
layout: model
title: French CamemBert Embeddings (from joe8zhang)
author: John Snow Labs
name: camembert_embeddings_joe8zhang_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `joe8zhang`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_joe8zhang_generic_model_fr_3.4.4_3.0_1653988899631.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_joe8zhang_generic_model_fr_3.4.4_3.0_1653988899631.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_joe8zhang_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_joe8zhang_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_joe8zhang_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/joe8zhang/dummy-model
---
layout: model
title: Korean ElectraForQuestionAnswering Small model (from monologg) Version-3
author: John Snow Labs
name: electra_qa_small_v3_finetuned_korquad
date: 2022-06-22
tags: [ko, open_source, electra, question_answering]
task: Question Answering
language: ko
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-small-v3-finetuned-korquad` is a Korean model originally trained by `monologg`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_small_v3_finetuned_korquad_ko_4.0.0_3.0_1655922310458.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_small_v3_finetuned_korquad_ko_4.0.0_3.0_1655922310458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_small_v3_finetuned_korquad","ko") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_small_v3_finetuned_korquad","ko")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.answer_question.korquad.electra.small").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_small_v3_finetuned_korquad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|ko|
|Size:|53.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/monologg/koelectra-small-v3-finetuned-korquad
---
layout: model
title: Detect Drugs and Posology Entities (ner_posology_greedy)
author: John Snow Labs
name: ner_posology_greedy
date: 2020-12-08
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.6.5
spark_version: 2.4
tags: [ner, licensed, clinical, en]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model detects drugs, dosage, form, frequency, duration, route, and drug strength in text. It differs from `ner_posology` in the sense that it chunks together drugs, dosage, form, strength, and route when they appear together, resulting in a bigger chunk. It is trained using `embeddings_clinical` so please use the same embeddings in the pipeline.
## Predicted Entities
`DRUG`, `STRENGTH`, `DURATION`, `FREQUENCY`, `FORM`, `DOSAGE`, `ROUTE`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_en_2.6.4_2.4_1607422064676.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_en_2.6.4_2.4_1607422064676.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_posology_greedy", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day."]]).toDF("text"))
```
```scala
...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = NerDLModel.pretrained("ner_posology_greedy", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter))
val data = Seq("The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.posology.greedy").predict("""The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_risk_factors_pipeline", "en", "clinical/models")
text = '''HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.
REVIEW OF SYSTEMS: All other systems reviewed & are negative.
PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.
SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker.
FAMILY HISTORY: Positive for coronary artery disease (father & brother).'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_risk_factors_pipeline", "en", "clinical/models")
val text = "HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.
REVIEW OF SYSTEMS: All other systems reviewed & are negative.
PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.
SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker.
FAMILY HISTORY: Positive for coronary artery disease (father & brother)."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.risk_factors.pipeline").predict("""HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.
REVIEW OF SYSTEMS: All other systems reviewed & are negative.
PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.
SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker.
FAMILY HISTORY: Positive for coronary artery disease (father & brother).""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:-------------------------------------|--------:|------:|:-------------|-------------:|
| 0 | diabetic | 136 | 143 | DIABETES | 0.9992 |
| 1 | coronary artery disease | 172 | 194 | CAD | 0.689667 |
| 2 | Diabetes mellitus type II | 1315 | 1339 | DIABETES | 0.73075 |
| 3 | hypertension | 1342 | 1353 | HYPERTENSION | 0.986 |
| 4 | coronary artery disease | 1356 | 1378 | CAD | 0.882567 |
| 5 | 1995 | 1422 | 1425 | PHI | 0.9999 |
| 6 | ABC | 1434 | 1436 | PHI | 0.9999 |
| 7 | Smokes 2 packs of cigarettes per day | 1481 | 1516 | SMOKER | 0.634257 |
| 8 | banker | 1530 | 1535 | PHI | 0.9779 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_risk_factors_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Amharic Named Entity Recognition (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic
date: 2022-05-17
tags: [xlm_roberta, ner, token_classification, am, open_source]
task: Named Entity Recognition
language: am
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-amharic` is a Amharic model orginally trained by `mbeukman`.
## Predicted Entities
`PER`, `ORG`, `LOC`, `DATE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic_am_3.4.2_3.0_1652810385568.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic_am_3.4.2_3.0_1652810385568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic","am") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["ስካርቻ nlp እወዳለሁ"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic","am")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("ስካርቻ nlp እወዳለሁ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|am|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-amharic
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://www.apache.org/licenses/LICENSE-2.0
- https://github.com/Michael-Beukm
---
layout: model
title: Word2Vec Embeddings in Sicilian (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, scn, open_source]
task: Embeddings
language: scn
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_scn_3.4.1_3.0_1647457121351.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_scn_3.4.1_3.0_1647457121351.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","scn") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","scn")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("scn.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|scn|
|Size:|138.5 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Google's Tapas Table Understanding (Tiny, SQA)
author: John Snow Labs
name: table_qa_tapas_tiny_finetuned_sqa
date: 2022-09-30
tags: [en, table, qa, question, answering, open_source]
task: Table Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: TapasForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark.
Size of this model: Tiny
Has aggregation operations?: False
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_tiny_finetuned_sqa_en_4.2.0_3.0_1664530438363.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_tiny_finetuned_sqa_en_4.2.0_3.0_1664530438363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
json_data = """
{
"header": ["name", "money", "age"],
"rows": [
["Donald Trump", "$100,000,000", "75"],
["Elon Musk", "$20,000,000,000,000", "55"]
]
}
"""
queries = [
"Who earns less than 200,000,000?",
"Who earns 100,000,000?",
"How much money has Donald Trump?",
"How old are they?",
]
data = spark.createDataFrame([
[json_data, " ".join(queries)]
]).toDF("table_json", "questions")
document_assembler = MultiDocumentAssembler() \
.setInputCols("table_json", "questions") \
.setOutputCols("document_table", "document_questions")
sentence_detector = SentenceDetector() \
.setInputCols(["document_questions"]) \
.setOutputCol("questions")
table_assembler = TableAssembler()\
.setInputCols(["document_table"])\
.setOutputCol("table")
tapas = TapasForQuestionAnswering\
.pretrained("table_qa_tapas_tiny_finetuned_sqa","en")\
.setInputCols(["questions", "table"])\
.setOutputCol("answers")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
table_assembler,
tapas
])
model = pipeline.fit(data)
model\
.transform(data)\
.selectExpr("explode(answers) AS answer")\
.select("answer")\
.show(truncate=False)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|answer |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} |
|{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|table_qa_tapas_tiny_finetuned_sqa|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|17.4 MB|
|Case sensitive:|false|
## References
https://www.microsoft.com/en-us/download/details.aspx?id=54253
---
layout: model
title: English asr_wav2vec2_large_960h_lv60_self_4_gram TFWav2Vec2ForCTC from patrickvonplaten
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60_self_4_gram` is a English model originally trained by patrickvonplaten.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram_en_4.2.0_3.0_1664021741793.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram_en_4.2.0_3.0_1664021741793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|757.4 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Catalan RoBERTa embeddings
author: cayorodriguez
name: roberta_embeddings_bsc
date: 2022-07-07
tags: [roberta, projecte_aina, ca, open_source]
task: Embeddings
language: ca
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: false
recommended: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Catalan Roberta Word Embeddings, used within the `PlanTL-GOB-ES/roberta-base-ca` project. This model requires a specific Tokenizer, as shown in the Python Examples section.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/cayorodriguez/roberta_embeddings_bsc_ca_3.4.4_3.0_1657198648319.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/cayorodriguez/roberta_embeddings_bsc_ca_3.4.4_3.0_1657198648319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
ex_list = ["aprox\.","pàg\.","p\.ex\.","gen\.","feb\.","abr\.","jul\.","set\.","oct\.","nov\.","dec\.","dr\.","dra\.","sr\.","sra\.","srta\.","núm\.","st\.","sta\.","pl\.","etc\.", "ex\."]
ex_list_all = []
ex_list_all.extend(ex_list)
ex_list_all.extend([x[0].upper() + x[1:] for x in ex_list])
ex_list_all.extend([x.upper() for x in ex_list])
tokenizer = Tokenizer() \
.setInputCols(['sentence']).setOutputCol('token')\
.setInfixPatterns(["(d|D)(els)","(d|D)(el)","(a|A)(ls)","(a|A)(l)","(p|P)(els)","(p|P)(el)",\
"([A-zÀ-ú_@]+)(-[A-zÀ-ú_@]+)",\
"(d'|D')([·A-zÀ-ú@_]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|'|,)+","(l'|L')([·A-zÀ-ú_]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|'|,)+", \
"(l'|l'|s'|s'|d'|d'|m'|m'|n'|n'|D'|D'|L'|L'|S'|S'|N'|N'|M'|M')([A-zÀ-ú_]+)",\
"""([A-zÀ-ú·]+)(\.|,|\)|\?|!|;|\:|\"|”)(\.|,|\)|\?|!|;|\:|\"|”)+""",\
"([A-zÀ-ú·]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(\.|,|;|:|\?|,)+",\
"([A-zÀ-ú·]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)",\
"(\.|\"|;|:|!|\?|\-|\(|\)|”|“|')+([0-9A-zÀ-ú_]+)",\
"([0-9A-zÀ-ú·]+)(\.|\"|;|:|!|\?|\(|\)|”|“|'|,|%)",\
"(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)+([0-9]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)+",\
"(d'|D'|l'|L')([·A-zÀ-ú@_]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)", \
"([\.|\"|;|:|!|\?|\-|\(|\)|”|“|,]+)([\.|\"|;|:|!|\?|\-|\(|\)|”|“|,]+)"]) \
.setExceptions(ex_list_all).fit(data)
embeddings = WordEmbeddingsModel.pretrained("roberta_embeddings_bsc","ca") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["M'encanta fer anar aixó."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_bsc|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Community|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|ca|
|Size:|300.3 MB|
|Case sensitive:|true|
## References
projecte-aina/catalan_general_crawling @ huggingface
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10_en_4.3.0_3.0_1674214731317.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10_en_4.3.0_3.0_1674214731317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|427.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-10
---
layout: model
title: Financial Executives Item Binary Classifier
author: John Snow Labs
name: finclf_executives_item
date: 2022-08-10
tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `executives` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
## Predicted Entities
`other`, `executives`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_executives_item_en_1.0.0_3.2_1660154397820.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_executives_item_en_1.0.0_3.2_1660154397820.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[executives]|
|[other]|
|[other]|
|[executives]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_executives_item|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.6 MB|
## References
Weak labelling on documents from Edgar database
## Benchmarking
```bash
label precision recall f1-score support
executives 0.96 0.98 0.97 46
other 0.98 0.96 0.97 45
accuracy - - 0.97 91
macro-avg 0.97 0.97 0.97 91
weighted-avg 0.97 0.97 0.97 91
```
---
layout: model
title: English BertForQuestionAnswering Cased model (from motiondew)
author: John Snow Labs
name: bert_qa_set_date_3_lr_3e_5_bs_32_ep_3
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_3-lr-3e-5-bs-32-ep-3` is a English model originally trained by `motiondew`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_3_lr_3e_5_bs_32_ep_3_en_4.0.0_3.0_1657188664961.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_3_lr_3e_5_bs_32_ep_3_en_4.0.0_3.0_1657188664961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_3_lr_3e_5_bs_32_ep_3","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_3_lr_3e_5_bs_32_ep_3","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_set_date_3_lr_3e_5_bs_32_ep_3|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/motiondew/bert-set_date_3-lr-3e-5-bs-32-ep-3
---
layout: model
title: Fast Neural Machine Translation Model from Albanian to English
author: John Snow Labs
name: opus_mt_sq_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, sq, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `sq`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sq_en_xx_2.7.0_2.4_1609167620053.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sq_en_xx_2.7.0_2.4_1609167620053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_sq_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_sq_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.sq.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_sq_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Detect Assertion Status (assertion_dl_biobert_scope_L10R10)
author: John Snow Labs
name: assertion_dl_biobert_scope_L10R10
date: 2022-03-24
tags: [licensed, clinical, en, assertion, biobert]
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 3.4.2
spark_version: 2.4
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained using `biobert_pubmed_base_cased` BERT token embeddings. It considers 10 tokens on the left and 10 tokens on the right side of the clinical entities extracted by NER models and assigns their assertion status based on their context in this scope.
## Predicted Entities
`present`, `absent`, `possible`, `conditional`, `associated_with_someone_else`, `hypothetical`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_biobert_scope_L10R10_en_3.4.2_2.4_1648148217364.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_biobert_scope_L10R10_en_3.4.2_2.4_1648148217364.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
token = Tokenizer()\
.setInputCols(['sentence'])\
.setOutputCol('token')
embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert_scope_L10R10","en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[document,
sentenceDetector,
token,
embeddings,
clinical_ner,
ner_converter,
clinical_assertion])
text = "Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer."
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert_scope_L10R10","en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter, clinical_assertion))
val data = Seq("Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert.biobert_l10210").predict("""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","th") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["ฉันรัก Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","th")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("ฉันรัก Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("th.embed.w2v_cc_300d").predict("""ฉันรัก Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|th|
|Size:|1.2 GB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Оcr base v2 for handwritten text
author: John Snow Labs
name: ocr_base_handwritten_v2
date: 2023-01-17
tags: [en, licensed]
task: OCR Text Detection & Recognition
language: en
nav_key: models
edition: Visual NLP 4.2.4
spark_version: 3.2.1
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Ocr base handwritten model v2 for recognise handwritten text based on TrOCR model pretrained on handwritten datasets. It is an Ocr base model for recognise handwritten text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/ocr/RECOGNIZE_HANDWRITTEN/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrImageToTextHandwritten_V2.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_handwritten_v2_en_4.2.2_3.0_1670602309000.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_handwritten_v2_en_4.2.2_3.0_1670602309000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Example
{%- capture input_image -%}

{%- endcapture -%}
{%- capture output_image -%}

{%- endcapture -%}
{% include templates/input_output_image.md
input_image=input_image
output_image=output_image
%}
## Output text
```bash
This is an example of handwritten
beerxt
Let's # check the performance !
I hope it will be awesome
```
## Model Information
{:.table-model}
|---|---|
|Model Name:|ocr_base_handwritten_v2|
|Type:|ocr|
|Compatibility:|Visual NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
---
layout: model
title: English DistilBertForQuestionAnswering model (from ajaypyatha)
author: John Snow Labs
name: distilbert_qa_sdsqna
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sdsqna` is a English model originally trained by `ajaypyatha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sdsqna_en_4.0.0_3.0_1654728628088.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sdsqna_en_4.0.0_3.0_1654728628088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sdsqna","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sdsqna","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.by_ajaypyatha").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_sdsqna|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ajaypyatha/sdsqna
---
layout: model
title: Pipeline to Detect Living Species(biobert_embeddings_biomedical)
author: John Snow Labs
name: ner_living_species_bert_pipeline
date: 2023-03-13
tags: [pt, ner, clinical, licensed, biobert]
task: Named Entity Recognition
language: pt
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_living_species_bert](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_bert_pt_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_pt_4.3.0_3.2_1678729438675.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_pt_4.3.0_3.2_1678729438675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_living_species_bert_pipeline", "pt", "clinical/models")
text = '''Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito..'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_living_species_bert_pipeline", "pt", "clinical/models")
val text = "Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:--------------------|--------:|------:|:------------|-------------:|
| 0 | rapariga | 4 | 11 | HUMAN | 0.9849 |
| 1 | pessoal | 41 | 47 | HUMAN | 0.9994 |
| 2 | paciente | 182 | 189 | HUMAN | 1 |
| 3 | gato | 368 | 371 | SPECIES | 0.9912 |
| 4 | veterinário | 413 | 423 | HUMAN | 0.9909 |
| 5 | Trichophyton rubrum | 632 | 650 | SPECIES | 0.9778 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_living_species_bert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|pt|
|Size:|684.8 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Legal Agricultural Structures And Production Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_agricultural_structures_and_production_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, agricultural_structures_and_production, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_agricultural_structures_and_production_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Agricultural_Structures_and_Production or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Agricultural_Structures_and_Production`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agricultural_structures_and_production_bert_en_1.0.0_3.0_1678111593399.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agricultural_structures_and_production_bert_en_1.0.0_3.0_1678111593399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Agricultural_Structures_and_Production]|
|[Other]|
|[Other]|
|[Agricultural_Structures_and_Production]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_agricultural_structures_and_production_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.3 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Agricultural_Structures_and_Production 0.83 0.91 0.87 44
Other 0.90 0.81 0.85 43
accuracy - - 0.86 87
macro-avg 0.87 0.86 0.86 87
weighted-avg 0.87 0.86 0.86 87
```
---
layout: model
title: XLM-RoBERTa Base (xlm_roberta_base)
author: John Snow Labs
name: xlm_roberta_base
date: 2021-05-25
tags: [xx, multilingual, embeddings, xlm_roberta, open_source]
task: Embeddings
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
[XLM-RoBERTa](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) is a scaled cross-lingual sentence encoder. It is trained on 2.5T of data across 100 languages data filtered from Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross-lingual benchmarks.
The XLM-RoBERTa model was proposed in [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.
It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_xx_3.1.0_2.4_1621961851929.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_xx_3.1.0_2.4_1621961851929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
```
```scala
val embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.embed.xlm").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_base|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Case sensitive:|true|
## Data Source
[https://huggingface.co/xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
---
layout: model
title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011)
author: John Snow Labs
name: distilbert_token_classifier_autotrain_name_all_904029569
date: 2023-03-14
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_all-904029569` is a English model originally trained by `ismail-lucifer011`.
## Predicted Entities
`Name`, `OOV`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029569_en_4.3.1_3.0_1678783428948.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029569_en_4.3.1_3.0_1678783428948.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029569","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029569","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_autotrain_name_all_904029569|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ismail-lucifer011/autotrain-name_all-904029569
---
layout: model
title: SDOH Tobacco Usage For Classification
author: John Snow Labs
name: genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli
date: 2023-01-14
tags: [en, licensed, generic_classifier, sdoh, tobacco, clinical]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
recommended: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Generic Classifier model is intended for detecting tobacco use in clinical notes and trained by using GenericClassifierApproach annotator. `Present:` if the patient was a current consumer of tobacco. `Past:` the patient was a consumer in the past and had quit. `Never:` if the patient had never consumed tobacco. `None: ` if there was no related text.
## Predicted Entities
`Present`, `Past`, `Never`, `None`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_TOBACCO/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli_en_4.2.4_3.0_1673697468673.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli_en_4.2.4_3.0_1673697468673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
features_asm = FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli", 'en', 'clinical/models')\
.setInputCols(["features"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier
])
text_list = ["Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes",
"The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago.",
"The patient denies any history of smoking or alcohol abuse. She lives with her one daughter.",
"She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not reported by patient, but there is apparently a history of alochol abuse."]
df = spark.createDataFrame(text_list, StringType()).toDF("text")
result = pipeline.fit(df).transform(df)
result.select("text", "class.result").show(truncate=100)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("features")
val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli", "en", "clinical/models")
.setInputCols("features")
.setOutputCol("class")
val pipeline = new PipelineModel().setStages(Array(
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier))
val data = Seq("Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.generic.sdoh_tobacco_sbiobert_cased").predict("""The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago.""")
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+---------+
| text| result|
+----------------------------------------------------------------------------------------------------+---------+
|Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 2...|[Present]|
|The patient quit smoking approximately two years ago with an approximately a 40 pack year history...| [Past]|
| The patient denies any history of smoking or alcohol abuse. She lives with her one daughter.| [Never]|
|She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not report...| [None]|
+----------------------------------------------------------------------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[features]|
|Output Labels:|[prediction]|
|Language:|en|
|Size:|3.4 MB|
## Benchmarking
```bash
label precision recall f1-score support
Never 0.89 0.90 0.90 487
None 0.86 0.78 0.82 269
Past 0.87 0.79 0.83 415
Present 0.63 0.82 0.71 203
accuracy - - 0.83 1374
macro-avg 0.81 0.82 0.81 1374
weighted-avg 0.84 0.83 0.83 1374
```
---
layout: model
title: Sentiment Analysis pipeline for English
author: John Snow Labs
name: analyze_sentiment
date: 2021-03-24
tags: [open_source, english, analyze_sentiment, pipeline, en]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The analyze_sentiment is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/analyze_sentiment_en_3.0.0_3.0_1616544471011.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/analyze_sentiment_en_3.0.0_3.0_1616544471011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('analyze_sentiment', lang = 'en')
result = pipeline.fullAnnotate("""Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!""")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("analyze_sentiment", lang = "en")
val result = pipeline.fullAnnotate("""Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!""")
```
{:.nlu-block}
```python
import nlu
text = ["""Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!"""]
result_df = nlu.load('en.classify').predict(text)
result_df
```
## Results
```bash
| | text | sentiment |
|---:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------|
| 0 | Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now! | positive |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|analyze_sentiment|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
---
layout: model
title: Lemmatizer (Serbian, SpacyLookup)
author: John Snow Labs
name: lemma_spacylookup
date: 2022-03-03
tags: [open_source, lemmatizer, sr]
task: Lemmatization
language: sr
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Serbian Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_sr_3.4.1_3.0_1646316491633.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_sr_3.4.1_3.0_1646316491633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","sr") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["Ниси бољи од мене"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","sr")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer))
val data = Seq("Ниси бољи од мене").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("sr.lemma").predict("""Ниси бољи од мене""")
```
## Results
```bash
+---------------------+
|result |
+---------------------+
|[Ниси, добар, од, ја]|
+---------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma_spacylookup|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|sr|
|Size:|3.2 MB|
---
layout: model
title: Czech asr_wav2vec2_large_xlsr_czech TFWav2Vec2ForCTC from arampacha
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_czech
date: 2022-09-25
tags: [wav2vec2, cs, audio, open_source, asr]
task: Automatic Speech Recognition
language: cs
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_czech` is a Czech model originally trained by arampacha.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_czech_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_czech_cs_4.2.0_3.0_1664120388474.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_czech_cs_4.2.0_3.0_1664120388474.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_czech", "cs")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_czech", "cs")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_czech|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|cs|
|Size:|1.2 GB|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from alex-apostolo)
author: John Snow Labs
name: roberta_qa_base_filtered_cuad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-filtered-cuad` is a English model originally trained by `alex-apostolo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_filtered_cuad_en_4.3.0_3.0_1674216293189.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_filtered_cuad_en_4.3.0_3.0_1674216293189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_filtered_cuad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_filtered_cuad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_filtered_cuad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|454.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/alex-apostolo/roberta-base-filtered-cuad
---
layout: model
title: Stopwords Remover for Czech language (358 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, cs, open_source]
task: Stop Words Removal
language: cs
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_cs_3.4.1_3.0_1646673248718.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_cs_3.4.1_3.0_1646673248718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","cs") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Nejste lepší než já"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","cs")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Nejste lepší než já").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("cs.stopwords").predict("""Nejste lepší než já""")
```
## Results
```bash
+---------------+
|result |
+---------------+
|[Nejste, lepší]|
+---------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|cs|
|Size:|2.4 KB|
---
layout: model
title: English BertForQuestionAnswering Cased model (from irenelizihui)
author: John Snow Labs
name: bert_qa_irenelizihui_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `irenelizihui`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_irenelizihui_finetuned_squad_en_4.0.0_3.0_1657186550664.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_irenelizihui_finetuned_squad_en_4.0.0_3.0_1657186550664.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_irenelizihui_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_irenelizihui_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_irenelizihui_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/irenelizihui/bert-finetuned-squad
---
layout: model
title: English RoBERTa Embeddings (Sampling strategy 'full select')
author: John Snow Labs
name: roberta_embeddings_distilroberta_base_climate_f
date: 2022-04-14
tags: [roberta, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilroberta-base-climate-f` is a English model orginally trained by `climatebert`.
Sampling strategy f: As expressed in the author's paper [here](https://arxiv.org/pdf/2110.12010.pdf), f is "full select" strategy, meaning all sentences from all corpora where selected.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_f_en_3.4.2_3.0_1649946254298.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_f_en_3.4.2_3.0_1649946254298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_f","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_f","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.distilroberta_base_climate_f").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_distilroberta_base_climate_f|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|310.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/climatebert/distilroberta-base-climate-f
- https://arxiv.org/abs/2110.12010
---
layout: model
title: Legal Multilabel Classification on Terms of Service (UNFAIR-ToS)
author: John Snow Labs
name: legmulticlf_unfair_tos
date: 2023-03-08
tags: [en, legal, licensed, classification, unfair, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
recommended: true
engine: tensorflow
annotator: MultiClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Multilabel Text Classification model that can help you classify 8 types of unfair contractual terms (sentences), meaning terms that potentially violate user rights according to European consumer law.
## Predicted Entities
`Arbitration`, `Choice_of_Law`, `Content_Removal`, `Contract_by_Using`, `Jurisdiction`, `Limitation_of_Liability`, `Unilateral_Change`, `Unilateral_Termination`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_unfair_tos_en_1.0.0_3.0_1678283272065.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_unfair_tos_en_1.0.0_3.0_1678283272065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
embeddingsSentence = nlp.SentenceEmbeddings()\
.setInputCols(["document", "embeddings"])\
.setOutputCol("sentence_embeddings")\
.setPoolingStrategy("AVERAGE")
docClassifier = nlp.MultiClassifierDLModel().pretrained("legmulticlf_unfair_tos", "en", "legal/models")\
.setInputCols("sentence_embeddings") \
.setOutputCol("class")
pipeline = nlp.Pipeline(
stages=[
document_assembler,
tokenizer,
embeddings,
embeddingsSentence,
docClassifier
]
)
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
light_model = nlp.LightPipeline(model)
result = light_model.annotate("""we may alter, suspend or discontinue any aspect of the service at any time, including the availability of any service feature, database or content.""")
```
## Results
```bash
['Unilateral_Change', 'Unilateral_Termination']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legmulticlf_unfair_tos|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|13.9 MB|
## References
Train dataset available [here](https://github.com/coastalcph/lex-glue)
## Benchmarking
```bash
label precision recall f1-score support
Arbitration 1.00 0.82 0.90 11
Choice_of_Law 0.93 0.93 0.93 14
Content_Removal 0.80 0.57 0.67 21
Contract_by_Using 0.93 0.82 0.87 17
Jurisdiction 1.00 1.00 1.00 16
Limitation_of_Liability 0.81 0.80 0.81 60
Other 0.78 0.71 0.75 66
Unilateral_Change 0.94 0.84 0.89 38
Unilateral_Termination 0.78 0.81 0.79 36
micro-avg 0.85 0.79 0.82 279
macro-avg 0.89 0.81 0.85 279
weighted-avg 0.85 0.79 0.82 279
samples-avg 0.78 0.80 0.78 279
```
---
layout: model
title: ICD10CM Entity Resolver
author: John Snow Labs
name: chunkresolve_icd10cm_clinical
class: ChunkEntityResolverModel
language: en
nav_key: models
repository: clinical/models
date: 2020-04-21
task: Entity Resolution
edition: Healthcare NLP 2.4.2
spark_version: 2.4
tags: [clinical,licensed,entity_resolution,en]
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance
## Predicted Entities
ICD10-CM Codes and their normalized definition with `clinical_embeddings`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/enterprise/healthcare/EntityResolution_ICD10_RxNorm_Detailed.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_clinical_en_2.4.5_2.4_1587491222166.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_clinical_en_2.4.5_2.4_1587491222166.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
...
icd10cm_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_clinical", "en", "clinical/models") \
.setInputCols(["ner_token", "chunk_embeddings"]) \
.setOutputCol("icd10cm_code") \
.setDistanceFunction("COSINE") \
.setNeighbours(5)
pipeline_icd10cm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, chunk_tokenizer, icd10cm_resolution])
pipeline_model = pipeline_icd10cm.fit(spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG."""]]).toDF("text"))
result = pipeline_model.transform(data)
```
```scala
...
val icd10cm_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_clinical", "en", "clinical/models")
.setInputCols("ner_token", "chunk_embeddings")
.setOutputCol("icd10cm_code")
.setDistanceFunction("COSINE")
.setNeighbours(5)
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, chunk_tokenizer, icd10cm_resolution))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
| | chunk | entity | resolved_text | code | cms |
|---|-----------------------------|-----------|----------------------------------------------------|--------|---------------------------------------------------|
| 0 | T2DM), | PROBLEM | Type 2 diabetes mellitus with diabetic nephrop... | E1121 | Type 2 diabetes mellitus with diabetic nephrop... |
| 1 | T2DM | PROBLEM | Type 2 diabetes mellitus with diabetic nephrop... | E1121 | Type 2 diabetes mellitus with diabetic nephrop... |
| 2 | polydipsia | PROBLEM | Polydipsia | R631 | Polydipsia:::Anhedonia:::Galactorrhea |
| 3 | interference from turbidity | PROBLEM | Non-working side interference | M2656 | Non-working side interference:::Hemoglobinuria... |
| 4 | polyuria | PROBLEM | Other polyuria | R358 | Other polyuria:::Polydipsia:::Generalized edem... |
| 5 | lipemia | PROBLEM | Glycosuria | R81 | Glycosuria:::Pure hyperglyceridemia:::Hyperchy... |
| 6 | starvation ketosis | PROBLEM | Propionic acidemia | E71121 | Propionic acidemia:::Bartter's syndrome:::Hypo... |
| 7 | HTG | PROBLEM | Pure hyperglyceridemia | E781 | Pure hyperglyceridemia:::Familial hypercholest... |
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|-------------------------------|
| Name: | chunkresolve_icd10cm_clinical |
| Type: | ChunkEntityResolverModel |
| Compatibility: | Spark NLP 2.4.2+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | token, chunk_embeddings |
|Output labels: | entity |
| Language: | en |
| Case sensitive: | True |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on ICD10 Clinical Modification dataset with tenths of variations per code.
https://www.icd10data.com/ICD10CM/Codes/
---
layout: model
title: Detect Assertion Status (assertion_dl_healthcare) - supports confidence scores
author: John Snow Labs
name: assertion_dl_healthcare
date: 2021-01-26
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 2.7.2
spark_version: 2.4
tags: [assertion, en, licensed, clinical, healthcare]
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Assign assertion status to clinical entities extracted by NER based on their context in the text.
## Predicted Entities
`absent`, `present`, `conditional`, `associated_with_someone_else`, `hypothetical`, `possible`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_healthcare_en_2.7.2_2.4_1611646187271.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_healthcare_en_2.7.2_2.4_1611646187271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_healthcare", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
clinical_assertion = AssertionDLModel.pretrained("assertion_dl_healthcare", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
clinical_assertion
])
data = spark.createDataFrame([["""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer."""]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_healthcare", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_healthcare","en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
clinical_assertion))
val data = Seq("Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert.healthcare").predict("""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.""")
```
## Results
```bash
+---------------+---------+----------------------------+
|chunk |ner_label|assertion |
+---------------+---------+----------------------------+
|severe fever |PROBLEM |present |
|sore throat |PROBLEM |present |
|stomach pain |PROBLEM |absent |
|an epidural |TREATMENT|present |
|PCA |TREATMENT|present |
|pain control |TREATMENT|present |
|short of breath|PROBLEM |conditional |
|CT |TEST |present |
|lung tumor |PROBLEM |present |
|Alzheimer |PROBLEM |associated_with_someone_else|
+---------------+---------+----------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|assertion_dl_healthcare|
|Compatibility:|Spark NLP 2.7.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, chunk, embeddings]|
|Output Labels:|[assertion]|
|Language:|en|
## Data Source
Trained on i2b2 assertion data
## Benchmarking
```bash
label tp fp fn prec rec f1
absent 726 86 98 0.894089 0.881068 0.887531
present 2544 232 119 0.916427 0.955314 0.935466
conditional 18 13 37 0.580645 0.327273 0.418605
associated_with_someone_else 40 5 9 0.888889 0.816327 0.851064
hypothetical 132 13 26 0.910345 0.835443 0.871287
possible 96 45 105 0.680851 0.477612 0.561404
Macro-average 3556 394 394 0.811874 0.715506 0.76065
Micro-average 3556 394 394 0.900253 0.900253 0.900253
```
---
layout: model
title: Legal Compensation Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_compensation_agreement_bert
date: 2023-01-29
tags: [en, legal, classification, compensation, agreement, licensed, bert, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_compensation_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `compensation-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`compensation-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_compensation_agreement_bert_en_1.0.0_3.0_1674990338214.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_compensation_agreement_bert_en_1.0.0_3.0_1674990338214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[operating-agreement]|
|[other]|
|[other]|
|[operating-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_operating_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
operating-agreement 0.96 0.84 0.90 31
other 0.94 0.99 0.96 82
accuracy - - 0.95 113
macro-avg 0.95 0.91 0.93 113
weighted-avg 0.95 0.95 0.95 113
```
---
layout: model
title: Igbo Named Entity Recognition (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo
date: 2022-05-17
tags: [xlm_roberta, ner, token_classification, ig, open_source]
task: Named Entity Recognition
language: ig
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-igbo` is a Igbo model orginally trained by `mbeukman`.
## Predicted Entities
`PER`, `ORG`, `LOC`, `DATE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo_ig_3.4.2_3.0_1652809139166.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo_ig_3.4.2_3.0_1652809139166.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo","ig") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Ahụrụ m n'anya na-atọ m ụtọ"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo","ig")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Ahụrụ m n'anya na-atọ m ụtọ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ig|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-igbo
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://www.apache.org/licenses/LICENSE-2.0
- https://github.com/Michael-Beukman/
---
layout: model
title: Translate Indic languages to English Pipeline
author: John Snow Labs
name: translate_inc_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, inc, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `inc`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_inc_en_xx_2.7.0_2.4_1609688074696.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_inc_en_xx_2.7.0_2.4_1609688074696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_inc_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_inc_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.inc.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_inc_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Kabyle asr_Kabyle_xlsr TFWav2Vec2ForCTC from Akashpb13
author: John Snow Labs
name: pipeline_asr_Kabyle_xlsr
date: 2022-09-24
tags: [wav2vec2, kab, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: kab
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Kabyle_xlsr` is a Kabyle model originally trained by Akashpb13.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Kabyle_xlsr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Kabyle_xlsr_kab_4.2.0_3.0_1664018945760.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Kabyle_xlsr_kab_4.2.0_3.0_1664018945760.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_Kabyle_xlsr', lang = 'kab')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_Kabyle_xlsr", lang = "kab")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_Kabyle_xlsr|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|kab|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Finnish T5ForConditionalGeneration Mini Cased model (from Finnish-NLP)
author: John Snow Labs
name: t5_mini_nl8
date: 2023-01-31
tags: [fi, open_source, t5, tensorflow]
task: Text Generation
language: fi
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-mini-nl8-finnish` is a Finnish model originally trained by `Finnish-NLP`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mini_nl8_fi_4.3.0_3.0_1675124948833.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mini_nl8_fi_4.3.0_3.0_1675124948833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_mini_nl8","fi") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_mini_nl8","fi")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_mini_nl8|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|fi|
|Size:|315.9 MB|
## References
- https://huggingface.co/Finnish-NLP/t5-mini-nl8-finnish
- https://arxiv.org/abs/1910.10683
- https://github.com/google-research/text-to-text-transfer-transformer
- https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511
- https://arxiv.org/abs/2002.05202
- https://arxiv.org/abs/2109.10686
- http://urn.fi/urn:nbn:fi:lb-2017070501
- http://urn.fi/urn:nbn:fi:lb-2021050401
- http://urn.fi/urn:nbn:fi:lb-2018121001
- http://urn.fi/urn:nbn:fi:lb-2020021803
- https://sites.research.google/trc/about/
- https://github.com/google-research/t5x
- https://github.com/spyysalo/yle-corpus
- https://github.com/aajanki/eduskunta-vkk
- https://sites.research.google/trc/
- https://www.linkedin.com/in/aapotanskanen/
- https://www.linkedin.com/in/rasmustoivanen/
---
layout: model
title: Sentence Entity Resolver for RxNorm (sbert_jsl_medium_rxnorm_uncased embeddings)
author: John Snow Labs
name: sbertresolve_jsl_rxnorm_augmented_med
date: 2021-12-28
tags: [clinical, entity_resolution, en, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbert_jsl_medium_rxnorm_uncased` Sentence Bert Embeddings. It is trained on the augmented version of the dataset which is used in previous RxNorm resolver models. Additionally, this model returns concept classes of the drugs in all_k_aux_labels column.
## Predicted Entities
`RxNorm Codes`, `Concept Classes`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_jsl_rxnorm_augmented_med_en_3.3.4_2.4_1640686630389.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_jsl_rxnorm_augmented_med_en_3.3.4_2.4_1640686630389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
| | RxNormCode | Resolution | all_k_results | all_k_distances | all_k_cosine_distances | all_k_resolutions | all_k_aux_labels |
|---:|-------------:|:-------------------------------------------|:----------------------------------|:----------------------------------|:----------------------------------|:----------------------------------------------------------------|:----------------------------------|
| 0 | 855333 | warfarin sodium 5 MG [Coumadin] | 855333:::855334:::1110792:::11... | 0.0000:::6.0548:::6.1667:::6.1... | 0.0000:::0.0515:::0.0536:::0.0... | warfarin sodium 5 MG [Coumadin]:::warfarin sodium 5 MG Oral ... | Branded Drug Comp:::Branded Dr... |
| 1 | 1537020 | aspirin Effervescent Oral Tablet | 1537020:::1191:::202547:::1001... | 0.0000:::0.0000:::8.8123:::9.3... | 0.0000:::0.0000:::0.1145:::0.1... | aspirin Effervescent Oral Tablet:::aspirin:::Empirin:::Ecpir... | Clinical Drug Form:::Ingredien... |
| 2 | 105029 | gabapentin 300 MG Oral Capsule [Neurontin] | 105029:::1718929:::1718930:::3... | 5.5969:::8.7502:::8.7502:::8.7... | 0.0452:::0.1092:::0.1092:::0.1... | gabapentin 300 MG Oral Capsule [Neurontin]:::olanzapine 300 ... | Branded Drug:::Clinical Drug C... |
| 3 | 261242 | rosiglitazone 4 MG Oral Tablet [Avandia] | 261242:::2123140:::1792373:::8... | 0.0000:::7.1217:::7.7113:::8.4... | 0.0000:::0.0728:::0.0843:::0.1... | rosiglitazone 4 MG Oral Tablet [Avandia]:::erdafitinib 4 MG ... | Branded Drug:::Branded Drug Co... |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbertresolve_jsl_rxnorm_augmented_med|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[rxnorm_code]|
|Language:|en|
|Size:|650.7 MB|
|Case sensitive:|false|
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-16-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8_en_4.0.0_3.0_1657184626320.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8_en_4.0.0_3.0_1657184626320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-16-finetuned-squad-seed-8
---
layout: model
title: Catalan RobertaForTokenClassification Cased model (from softcatala)
author: John Snow Labs
name: roberta_token_classifier_fullstop_catalan_punctuation_prediction
date: 2023-03-01
tags: [ca, open_source, roberta, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: ca
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fullstop-catalan-punctuation-prediction` is a Catalan model originally trained by `softcatala`.
## Predicted Entities
`.`, `?`, `-`, `:`, `,`, `0`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_fullstop_catalan_punctuation_prediction_ca_4.3.0_3.0_1677703587592.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_fullstop_catalan_punctuation_prediction_ca_4.3.0_3.0_1677703587592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_fullstop_catalan_punctuation_prediction","ca") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_fullstop_catalan_punctuation_prediction","ca")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_token_classifier_fullstop_catalan_punctuation_prediction|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ca|
|Size:|457.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/softcatala/fullstop-catalan-punctuation-prediction
- https://github.com/oliverguhr/fullstop-deep-punctuation-prediction
---
layout: model
title: Bemba (Zambia) asr_wav2vec2_large_xls_r_1b_bemba_fds TFWav2Vec2ForCTC from csikasote
author: John Snow Labs
name: asr_wav2vec2_large_xls_r_1b_bemba_fds
date: 2022-09-24
tags: [wav2vec2, bem, audio, open_source, asr]
task: Automatic Speech Recognition
language: bem
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_1b_bemba_fds` is a Bemba (Zambia) model originally trained by csikasote.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_1b_bemba_fds_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_1b_bemba_fds_bem_4.2.0_3.0_1664043378039.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_1b_bemba_fds_bem_4.2.0_3.0_1664043378039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xls_r_1b_bemba_fds", "bem")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xls_r_1b_bemba_fds", "bem")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xls_r_1b_bemba_fds|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|bem|
|Size:|3.6 GB|
---
layout: model
title: Legal Conduct of business Clause Binary Classifier
author: John Snow Labs
name: legclf_conduct_of_business_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `conduct-of-business` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `conduct-of-business`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_conduct_of_business_clause_en_1.0.0_3.2_1660122262718.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_conduct_of_business_clause_en_1.0.0_3.2_1660122262718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[conduct-of-business]|
|[other]|
|[other]|
|[conduct-of-business]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_conduct_of_business_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
conduct-of-business 0.96 0.75 0.84 32
other 0.92 0.99 0.96 98
accuracy - - 0.93 130
macro-avg 0.94 0.87 0.90 130
weighted-avg 0.93 0.93 0.93 130
```
---
layout: model
title: Legal Subsidiaries Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_subsidiaries_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, subsidiaries, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Subsidiaries` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Subsidiaries`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subsidiaries_bert_en_1.0.0_3.0_1678050508907.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subsidiaries_bert_en_1.0.0_3.0_1678050508907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Subsidiaries]|
|[Other]|
|[Other]|
|[Subsidiaries]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_subsidiaries_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.98 0.96 0.97 85
Subsidiaries 0.95 0.97 0.96 64
accuracy - - 0.97 149
macro-avg 0.97 0.97 0.97 149
weighted-avg 0.97 0.97 0.97 149
```
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_ff2000
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-ff2000` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff2000_en_4.3.0_3.0_1675123479854.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff2000_en_4.3.0_3.0_1675123479854.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_ff2000","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_ff2000","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_ff2000|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|46.1 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-ff2000
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Spanish BERT Sentence Base Cased Embedding
author: John Snow Labs
name: sent_bert_base_cased
date: 2021-09-06
tags: [spanish, open_source, bert_sentence_embeddings, cased, es]
task: Embeddings
language: es
edition: Spark NLP 3.2.2
spark_version: 3.0
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
BETO is a BERT model trained on a big Spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with Multilingual BERT as well as other (not BERT-based) models.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_es_3.2.2_3.0_1630926259701.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_es_3.2.2_3.0_1630926259701.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "es") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
```
```scala
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "es")
.setInputCols("sentence")
.setOutputCol("bert_sentence")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
```
{:.nlu-block}
```python
import nlu
nlu.load("es.embed_sentence.bert.base_cased").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_base_cased|
|Compatibility:|Spark NLP 3.2.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[bert_sentence]|
|Language:|es|
|Case sensitive:|true|
## Data Source
The model is imported from: https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-8` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8_en_4.0.0_3.0_1654191434774.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8_en_4.0.0_3.0_1654191434774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.span_bert.base_cased_128d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|381.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-8
---
layout: model
title: Korean DistilBertForQuestionAnswering Cased model (from pakupoko)
author: John Snow Labs
name: distilbert_qa_bizlin_model
date: 2023-01-03
tags: [ko, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: ko
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bizlin-distil-model` is a Korean model originally trained by `pakupoko`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bizlin_model_ko_4.3.0_3.0_1672765867213.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bizlin_model_ko_4.3.0_3.0_1672765867213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bizlin_model","ko")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bizlin_model","ko")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_bizlin_model|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|ko|
|Size:|104.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/pakupoko/bizlin-distil-model
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_nh32
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nh32` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh32_en_4.3.0_3.0_1675123694521.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh32_en_4.3.0_3.0_1675123694521.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_nh32","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_nh32","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_nh32|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|88.2 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-nh32
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Typo Detector Pipeline for English
author: ahmedlone127
name: distilbert_token_classifier_typo_detector_pipeline
date: 2022-06-14
tags: [ner, bert, bert_for_token, typo, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: false
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [distilbert_token_classifier_typo_detector](https://nlp.johnsnowlabs.com/2022/01/19/distilbert_token_classifier_typo_detector_en.html).
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/TYPO_DETECTOR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/DistilBertForTokenClassification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/distilbert_token_classifier_typo_detector_pipeline_en_4.0.0_3.0_1655212406234.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/distilbert_token_classifier_typo_detector_pipeline_en_4.0.0_3.0_1655212406234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
typo_pipeline = PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "en")
typo_pipeline.annotate("He had also stgruggled with addiction during his tine in Congress.")
```
```scala
val typo_pipeline = new PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "en")
typo_pipeline.annotate("He had also stgruggled with addiction during his tine in Congress.")
```
## Results
```bash
+----------+---------+
|chunk |ner_label|
+----------+---------+
|stgruggled|PO |
|tine |PO |
+----------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_typo_detector_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Community|
|Language:|en|
|Size:|244.2 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- DistilBertForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Legal Employment Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_employment_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, employment, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Employment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Employment`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employment_bert_en_1.0.0_3.0_1678050529088.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employment_bert_en_1.0.0_3.0_1678050529088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Employment]|
|[Other]|
|[Other]|
|[Employment]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_employment_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Employment 0.96 0.96 0.96 27
Other 0.98 0.98 0.98 49
accuracy - - 0.97 76
macro-avg 0.97 0.97 0.97 76
weighted-avg 0.97 0.97 0.97 76
```
---
layout: model
title: English BertForQuestionAnswering Cased model (from ponmari)
author: John Snow Labs
name: bert_qa_questionansweing
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `QuestionAnsweingBert` is a English model originally trained by `ponmari`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_questionansweing_en_4.0.0_3.0_1657182420944.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_questionansweing_en_4.0.0_3.0_1657182420944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_questionansweing","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_questionansweing","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_questionansweing|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ponmari/QuestionAnsweingBert
---
layout: model
title: English DistilBertForQuestionAnswering model (from Rocketknight1)
author: John Snow Labs
name: distilbert_qa_Rocketknight1_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Rocketknight1`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Rocketknight1_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724414659.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Rocketknight1_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724414659.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Rocketknight1_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Rocketknight1_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Rocketknight1").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_Rocketknight1_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Rocketknight1/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_ff3000
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-ff3000` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff3000_en_4.3.0_3.0_1675123510286.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff3000_en_4.3.0_3.0_1675123510286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_ff3000","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_ff3000","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_ff3000|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|62.1 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-ff3000
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_el16_dl8
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el16-dl8` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_dl8_en_4.3.0_3.0_1675119813436.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_dl8_en_4.3.0_3.0_1675119813436.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_el16_dl8","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_el16_dl8","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_el16_dl8|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|224.3 MB|
## References
- https://huggingface.co/google/t5-efficient-small-el16-dl8
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English asr_wav2vec2_large_english TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: asr_wav2vec2_large_english
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_english` is a English model originally trained by jonatasgrosman.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_english_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_english_en_4.2.0_3.0_1664020258828.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_english_en_4.2.0_3.0_1664020258828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_english", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_english", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_english|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Pipeline to Detect Living Species
author: John Snow Labs
name: bert_token_classifier_ner_living_species_pipeline
date: 2023-03-20
tags: [pt, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: pt
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_living_species](https://nlp.johnsnowlabs.com/2022/06/27/bert_token_classifier_ner_living_species_pt_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pipeline_pt_4.3.0_3.2_1679304320046.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pipeline_pt_4.3.0_3.2_1679304320046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_living_species_pipeline", "pt", "clinical/models")
text = '''Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_living_species_pipeline", "pt", "clinical/models")
val text = "Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:--------------------|--------:|------:|:------------|-------------:|
| 0 | rapariga | 4 | 11 | HUMAN | 0.999888 |
| 1 | pessoal | 41 | 47 | HUMAN | 0.99987 |
| 2 | paciente | 182 | 189 | HUMAN | 0.999731 |
| 3 | gato | 368 | 371 | SPECIES | 0.999365 |
| 4 | veterinário | 413 | 423 | HUMAN | 0.982236 |
| 5 | Trichophyton rubrum | 632 | 650 | SPECIES | 0.996602 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_living_species_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|pt|
|Size:|666.0 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: Fast Neural Machine Translation Model from English to Romanian
author: John Snow Labs
name: opus_mt_en_ro
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ro, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ro`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ro_xx_2.7.0_2.4_1609168722172.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ro_xx_2.7.0_2.4_1609168722172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ro", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ro", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ro').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ro|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_english_colab TFWav2Vec2ForCTC from shacharm
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_english_colab
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_english_colab` is a English model originally trained by shacharm.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_english_colab_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_english_colab_en_4.2.0_3.0_1664103544762.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_english_colab_en_4.2.0_3.0_1664103544762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_english_colab', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_english_colab", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_english_colab|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Sentence Embeddings - sbert medium (tuned)
author: John Snow Labs
name: sbert_jsl_medium_umls_uncased
date: 2021-06-30
tags: [embeddings, clinical, licensed, en]
task: Embeddings
language: en
nav_key: models
edition: Healthcare NLP 3.1.0
spark_version: 2.4
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained to generate contextual sentence embeddings of input sentences.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_umls_uncased_en_3.1.0_2.4_1625050119656.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_umls_uncased_en_3.1.0_2.4_1625050119656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_umls_uncased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings")
```
```scala
val sbiobert_embeddings = BertSentenceEmbeddings
.pretrained("sbert_jsl_medium_umls_uncased","en","clinical/models")
.setInputCols(Array("sentence"))
.setOutputCol("sbert_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed_sentence.bert.jsl_medium_umls_uncased").predict("""Put your text here.""")
```
## Results
```bash
Gives a 768 dimensional vector representation of the sentence.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbert_jsl_medium_umls_uncased|
|Compatibility:|Healthcare NLP 3.1.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Case sensitive:|false|
## Data Source
Tuned on MedNLI and UMLS dataset
## Benchmarking
```bash
MedNLI Acc: 0.744, STS (cos): 0.695
```
## Benchmarking
```bash
MedNLI Score
Acc 0.744
STS(cos) 0.695
```
---
layout: model
title: Detect Persons, Locations, Organizations and Misc Entities in Russian (WikiNER 6B 100)
author: John Snow Labs
name: wikiner_6B_100
date: 2020-03-16
task: Named Entity Recognition
language: ru
edition: Spark NLP 2.4.4
spark_version: 2.4
tags: [ner, ru, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline.
{:.h2_title}
## Predicted Entities
Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_RU){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_RU.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_ru_2.4.4_2.4_1584014001452.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_ru_2.4.4_2.4_1584014001452.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained("glove_100d") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained("wikiner_6B_100", "ru") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла.']], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("glove_100d")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("wikiner_6B_100", "ru")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла."""]
ner_df = nlu.load('ru.ner.wikiner.glove.6B_100').predict(text, output_level = "chunk")
ner_df[["entities", "entities_confidence"]]
```
{:.h2_title}
## Results
```bash
+----------------------+---------+
|chunk |ner_label|
+----------------------+---------+
|Уильям Генри Гейтс III|PER |
|Microsoft |ORG |
|За |ORG |
|Microsoft Гейтс |MISC |
|CEO |ORG |
|Он |PER |
|Гейтс |PER |
|Сиэтле |LOC |
|Вашингтон |LOC |
|Полом Алленом |PER |
|Альбукерке |LOC |
|Нью-Мексико |LOC |
|Microsoft |ORG |
|Гейтс |PER |
|Гейтс |PER |
|Это |PER |
|В июне 2006 |MISC |
|Гейтс |PER |
|Microsoft |ORG |
|Фонде Билла |ORG |
+----------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wikiner_6B_100|
|Type:|ner|
|Compatibility:| Spark NLP 2.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ru|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is trained based on data from [https://ru.wikipedia.org](https://ru.wikipedia.org)
---
layout: model
title: English BertForQuestionAnswering model (from KevinChoi)
author: John Snow Labs
name: bert_qa_KevinChoi_bert_finetuned_squad_accelerate
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model orginally trained by `KevinChoi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_KevinChoi_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535806391.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_KevinChoi_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535806391.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_KevinChoi_bert_finetuned_squad_accelerate","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_KevinChoi_bert_finetuned_squad_accelerate","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.accelerate.by_KevinChoi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_KevinChoi_bert_finetuned_squad_accelerate|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/KevinChoi/bert-finetuned-squad-accelerate
---
layout: model
title: Explain Document ML Pipeline for English
author: John Snow Labs
name: explain_document_ml
date: 2021-03-23
tags: [open_source, english, explain_document_ml, pipeline, en]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_ml is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_ml_en_3.0.0_3.0_1616473253101.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_ml_en_3.0.0_3.0_1616473253101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('explain_document_ml', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_ml", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.explain').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | spell | lemmas | stems | pos |
|---:|:---------------------------------|:---------------------------------|:-------------------------------------------------|:------------------------------------------------|:------------------------------------------------|:-----------------------------------------------|:---------------------------------------|
| 0 | ['Hello fronm John Snwow Labs!'] | ['Hello fronm John Snwow Labs!'] | ['Hello', 'fronm', 'John', 'Snwow', 'Labs', '!'] | ['Hello', 'front', 'John', 'Snow', 'Labs', '!'] | ['Hello', 'front', 'John', 'Snow', 'Labs', '!'] | ['hello', 'front', 'john', 'snow', 'lab', '!'] | ['UH', 'NN', 'NNP', 'NNP', 'NNP', '.'] || | document | sentence | token | spell | lemmas | stems | pos |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_ml|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
---
layout: model
title: English asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4 TFWav2Vec2ForCTC from chrisvinsen
author: John Snow Labs
name: asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4` is a English model originally trained by chrisvinsen.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4_en_4.2.0_3.0_1664103661065.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4_en_4.2.0_3.0_1664103661065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Legal Salary Clause Binary Classifier
author: John Snow Labs
name: legclf_salary_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `salary` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `salary`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_salary_clause_en_1.0.0_3.2_1660123961758.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_salary_clause_en_1.0.0_3.2_1660123961758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[salary]|
|[other]|
|[other]|
|[salary]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_salary_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 1.00 1.00 79
salary 1.00 1.00 1.00 33
accuracy - - 1.00 112
macro-avg 1.00 1.00 1.00 112
weighted-avg 1.00 1.00 1.00 112
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Lingala
author: John Snow Labs
name: opus_mt_en_ln
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ln, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ln`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ln_xx_2.7.0_2.4_1609170351544.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ln_xx_2.7.0_2.4_1609170351544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ln", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ln", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ln').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ln|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_base_100h_with_lm_turkish TFWav2Vec2ForCTC from gorkemgoknar
author: John Snow Labs
name: asr_wav2vec2_base_100h_with_lm_turkish
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_with_lm_turkish` is a English model originally trained by gorkemgoknar.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_100h_with_lm_turkish_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_with_lm_turkish_en_4.2.0_3.0_1664038528506.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_with_lm_turkish_en_4.2.0_3.0_1664038528506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_100h_with_lm_turkish", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_100h_with_lm_turkish", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_100h_with_lm_turkish|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|354.3 MB|
---
layout: model
title: Detect Diseases in Medical Text
author: John Snow Labs
name: bert_token_classifier_ner_bc5cdr_disease
date: 2022-07-25
tags: [en, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Chemicals, diseases, and their relations are among the most searched topics by PubMed users worldwide as they play central roles in many areas of biomedical research and healthcare, such as drug discovery and safety surveillance.
This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. The model detects disease from a medical text
## Predicted Entities
`DISEASE`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_disease_en_4.0.0_3.0_1658754395259.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_disease_en_4.0.0_3.0_1658754395259.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc5cdr_disease", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("ner")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc5cdr_disease", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
.setCaseSensitive(True)
.setMaxSentenceLength(512)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter))
val data = Seq("""Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.bc5cdr_disease").predict("""Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer.""")
```
## Results
```bash
+---------------------+-------+
|ner_chunk |label |
+---------------------+-------+
|interstitial cystitis|DISEASE|
|mastocytosis |DISEASE|
|cystitis |DISEASE|
|prostate cancer |DISEASE|
+---------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_bc5cdr_disease|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
[https://github.com/cambridgeltl/MTL-Bioinformatics-2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016)
## Benchmarking
```bash
label precision recall f1-score support
B-DISEASE 0.7905 0.9146 0.8480 4424
I-DISEASE 0.6521 0.8725 0.7464 2737
micro-avg 0.7328 0.8985 0.8072 7161
macro-avg 0.7213 0.8935 0.7972 7161
weighted-avg 0.7376 0.8985 0.8092 7161
```
---
layout: model
title: Urdu DistilBERT Embeddings (from Geotrend)
author: John Snow Labs
name: distilbert_embeddings_distilbert_base_ur_cased
date: 2022-04-12
tags: [distilbert, embeddings, ur, open_source]
task: Embeddings
language: ur
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-ur-cased` is a Urdu model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ur_cased_ur_3.4.2_3.0_1649783731492.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ur_cased_ur_3.4.2_3.0_1649783731492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ur_cased","ur") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["مجھے سپارک این ایل پی سے محبت ہے"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ur_cased","ur")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("مجھے سپارک این ایل پی سے محبت ہے").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ur.embed.distilbert_base_cased").predict("""مجھے سپارک این ایل پی سے محبت ہے""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_distilbert_base_ur_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ur|
|Size:|186.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/distilbert-base-ur-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: English BertForQuestionAnswering model (from ixa-ehu)
author: John Snow Labs
name: bert_qa_SciBERT_SQuAD_QuAC
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SciBERT-SQuAD-QuAC` is a English model orginally trained by `ixa-ehu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_SciBERT_SQuAD_QuAC_en_4.0.0_3.0_1654179044906.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_SciBERT_SQuAD_QuAC_en_4.0.0_3.0_1654179044906.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_SciBERT_SQuAD_QuAC","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_SciBERT_SQuAD_QuAC","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.scibert.by_ixa-ehu").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_SciBERT_SQuAD_QuAC|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|410.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ixa-ehu/SciBERT-SQuAD-QuAC
- https://www.aclweb.org/anthology/P18-2124/
- https://arxiv.org/abs/1808.07036
---
layout: model
title: German asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886` is a German model originally trained by jonatasgrosman.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886_de_4.2.0_3.0_1664115914159.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886_de_4.2.0_3.0_1664115914159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: Named Entity Recognition in Romanian Official Documents (Medium)
author: John Snow Labs
name: legner_romanian_official_md
date: 2022-11-10
tags: [ro, ner, legal, licensed]
task: Named Entity Recognition
language: ro
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a medium version of NER model that extracts PER(Person), LOC(Location), ORG(Organization), DATE and LEGAL entities from Romanian Official Documents. Different from small version, it labels all entities related to legal domain as LEGAL.
## Predicted Entities
`PER`, `LOC`, `ORG`, `DATE`, `LEGAL`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/LEGNER_ROMANIAN_OFFICIAL/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_romanian_official_md_ro_1.0.0_3.0_1668083301892.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_romanian_official_md_ro_1.0.0_3.0_1668083301892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")\
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_base_cased", "ro")\
.setInputCols("sentence", "token")\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_romanian_official_md", "ro", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")\
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame([["""Anexa nr. 1 la Ordinul ministrului sănătății nr. 1.468 / 2018 pentru aprobarea prețurilor maximale ale medicamentelor de uz uman, valabile în România, care pot fi utilizate / comercializate de către deținătorii de autorizație de punere pe piață a medicamentelor sau reprezentanții acestora, distribuitorii angro și furnizorii de servicii medicale și medicamente pentru acele medicamente care fac obiectul unei relații contractuale cu Ministerul Sănătății, casele de asigurări de sănătate și / sau direcțiile de sănătate publică județene și a municipiului București, cuprinse în Catalogul național al prețurilor medicamentelor autorizate de punere pe piață în România, a prețurilor de referință generice și a prețurilor de referință inovative, publicat în Monitorul Oficial al României, Partea I nr. 989 și 989 bis din 22 noiembrie 2018, cu modificările și completările ulterioare, se modifică și se completează conform anexei care face parte integrantă din prezentul ordin."""]]).toDF("text")
result = model.transform(data)
```
## Results
```bash
+----------------------------------------------+-----+
|chunk |label|
+----------------------------------------------+-----+
|Ordinul ministrului sănătății nr. 1.468 / 2018|LEGAL|
|România |LOC |
|Ministerul Sănătății |ORG |
|București |LOC |
|România |LOC |
|Monitorul Oficial al României |ORG |
|22 noiembrie 2018 |DATE |
+----------------------------------------------+-----+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_romanian_official_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ro|
|Size:|16.5 MB|
## References
Dataset is available [here](https://zenodo.org/record/7025333#.Y2zsquxBx83).
## Benchmarking
```bash
label precision recall f1-score support
DATE 0.84 0.92 0.88 218
LEGAL 0.89 0.96 0.92 337
LOC 0.82 0.77 0.79 158
ORG 0.87 0.88 0.88 463
PER 0.97 0.97 0.97 87
micro-avg 0.87 0.90 0.89 1263
macro-avg 0.88 0.90 0.89 1263
weighted-avg 0.87 0.90 0.89 1263
```
---
layout: model
title: German Electra Embeddings (from deepset)
author: John Snow Labs
name: electra_embeddings_gelectra_large_generator
date: 2022-05-17
tags: [de, open_source, electra, embeddings]
task: Embeddings
language: de
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gelectra-large-generator` is a German model orginally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_gelectra_large_generator_de_3.4.4_3.0_1652786854236.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_gelectra_large_generator_de_3.4.4_3.0_1652786854236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_gelectra_large_generator","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_gelectra_large_generator","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_gelectra_large_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|de|
|Size:|194.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/deepset/gelectra-large-generator
- https://arxiv.org/pdf/2010.10906.pdf
- https://arxiv.org/pdf/2010.10906.pdf
- https://deepset.ai/german-bert
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/FARM
- https://github.com/deepset-ai/haystack/
- https://twitter.com/deepset_ai
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- http://www.deepset.ai/jobs
---
layout: model
title: Legal Deterioration Of The Environment Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_deterioration_of_the_environment_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, deterioration_of_the_environment, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_deterioration_of_the_environment_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Deterioration_of_The_Environment or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Deterioration_of_The_Environment`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_deterioration_of_the_environment_bert_en_1.0.0_3.0_1678111773583.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_deterioration_of_the_environment_bert_en_1.0.0_3.0_1678111773583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Deterioration_of_The_Environment]|
|[Other]|
|[Other]|
|[Deterioration_of_The_Environment]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_deterioration_of_the_environment_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Deterioration_of_The_Environment 0.92 0.91 0.91 196
Other 0.91 0.92 0.91 191
accuracy - - 0.91 387
macro-avg 0.91 0.91 0.91 387
weighted-avg 0.91 0.91 0.91 387
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from pakupoko)
author: John Snow Labs
name: distilbert_qa_bizlin_distil_model
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bizlin-distil-model` is a English model originally trained by `pakupoko`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bizlin_distil_model_en_4.0.0_3.0_1654723375668.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bizlin_distil_model_en_4.0.0_3.0_1654723375668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bizlin_distil_model","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bizlin_distil_model","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.by_pakupoko").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_bizlin_distil_model|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|104.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/pakupoko/bizlin-distil-model
---
layout: model
title: English RobertaForQuestionAnswering (from deepset)
author: John Snow Labs
name: roberta_qa_tinyroberta_squad2
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinyroberta-squad2` is a English model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tinyroberta_squad2_en_4.0.0_3.0_1655740021196.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tinyroberta_squad2_en_4.0.0_3.0_1655740021196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_tinyroberta_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_tinyroberta_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.tiny.by_deepset").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_tinyroberta_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|307.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/deepset/tinyroberta-squad2
- https://www.linkedin.com/company/deepset-ai/
- https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
- https://arxiv.org/pdf/1909.10351.pdf
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack
- http://deepset.ai/
- https://haystack.deepset.ai/
- http://www.deepset.ai/jobs
- https://twitter.com/deepset_ai
- https://github.com/deepset-ai/haystack/discussions
- https://github.com/deepset-ai/haystack/
- https://deepset.ai
- https://deepset.ai/germanquad
- https://haystack.deepset.ai
- https://deepset.ai/german-bert
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-6` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6_en_4.0.0_3.0_1654537973401.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6_en_4.0.0_3.0_1654537973401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.span_bert.base_cased_1024d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|390.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-6
---
layout: model
title: English DistilBertForQuestionAnswering model (from adamlin)
author: John Snow Labs
name: distilbert_qa_base_cased_sgd_qa_step5000
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-sgd_qa-step5000` is a English model originally trained by `adamlin`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_sgd_qa_step5000_en_4.0.0_3.0_1654723671592.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_sgd_qa_step5000_en_4.0.0_3.0_1654723671592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_sgd_qa_step5000","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_sgd_qa_step5000","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.base_cased.by_adamlin").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_cased_sgd_qa_step5000|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/adamlin/distilbert-base-cased-sgd_qa-step5000
---
layout: model
title: Sentiment Analysis for Urdu (IMDB Review dataset)
author: John Snow Labs
name: sentimentdl_urduvec_imdb
date: 2021-01-09
task: Sentiment Analysis
language: ur
edition: Spark NLP 2.7.1
spark_version: 2.4
tags: [ur, open_source, sentiment]
supported: true
annotator: SentimentDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Analyse sentiment in reviews by classifying them as `positive` or `negative`. This model is trained using `urduvec_140M_300d` word embeddings. The word embeddings are then converted to sentence embeddings before feeding to the sentiment classifier which uses a DL architecture to classify sentences.
## Predicted Entities
`positive` , `negative`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentimentdl_urduvec_imdb_ur_2.7.1_2.4_1610185467237.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentimentdl_urduvec_imdb_ur_2.7.1_2.4_1610185467237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, SentenceEmbeddings.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel()\
.pretrained('urduvec_140M_300d', 'ur')\
.setInputCols(["sentence",'token'])\
.setOutputCol("word_embeddings")
sentence_embeddings = SentenceEmbeddings() \
.setInputCols(["sentence", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
classifier = SentimentDLModel.pretrained('sentimentdl_urduvec_imdb', 'ur' )\
.setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('sentiment')
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, sentence_embeddings, classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate(["مجھے واقعی یہ شو سند ہے۔ یہی وجہ ہے کہ مجھے حال ہی میں یہ جان کر مایوسی ہوئی ہے کہ جارج لوپیز ایک ",
"بالکل بھی اچھ ،ی کام نہیں کیا گیا ، پوری فلم صرف گرڈج تھی اور کہیں بھی بے ترتیب لوگوں کو ہلاک نہیں"])
```
{:.nlu-block}
```python
import nlu
text = ["مجھے واقعی یہ شو سند ہے۔ یہی وجہ ہے کہ مجھے حال ہی میں یہ جان کر مایوسی ہوئی ہے کہ جارج لوپیز ایک ", "بالکل بھی اچھ ،ی کام نہیں کیا گیا ، پوری فلم صرف گرڈج تھی اور کہیں بھی بے ترتیب لوگوں کو ہلاک نہیں"]
urdusent_df = nlu.load('ur.sentiment').predict(text, output_level='sentence')
urdusent_df
```
## Results
```bash
| | document | sentiment |
|---:|---------------------------------------------------------------------------------------------------------:|--------------:|
| 0 |مجھے واقعی یہ شو سند ہے۔ یہی وجہ ہے کہ مجھے حال ہی میں یہ جان کر مایوسی ہوئی ہے کہ جارج لوپیز ایک | positive |
| 1 |بالکل بھی اچھ ،ی کام نہیں کیا گیا ، پوری فلم صرف گرڈج تھی اور کہیں بھی بے ترتیب لوگوں کو ہلاک نہیں | negative |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sentimentdl_urduvec_imdb|
|Compatibility:|Spark NLP 2.7.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[sentiment]|
|Language:|ur|
|Dependencies:|urduvec_140M_300d|
## Data Source
This models in trained using data from https://www.kaggle.com/akkefa/imdb-dataset-of-50k-movie-translated-urdu-reviews
## Benchmarking
```bash
loss: 2428.622 - acc: 0.8181 - val_acc: 80.0
```
---
layout: model
title: Arabic BertForMaskedLM Base Cased model (from CAMeL-Lab)
author: John Snow Labs
name: bert_embeddings_base_arabic_camel_mix
date: 2022-12-02
tags: [ar, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ar
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabic-camelbert-mix` is a Arabic model originally trained by `CAMeL-Lab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_mix_ar_4.2.4_3.0_1670015990923.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_mix_ar_4.2.4_3.0_1670015990923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_mix","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_mix","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_arabic_camel_mix|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|409.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-mix
- https://arxiv.org/abs/2103.06678
- https://github.com/CAMeL-Lab/CAMeLBERT
- https://catalog.ldc.upenn.edu/LDC2011T11
- http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus
- https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian
- https://archive.org/details/arwiki-20190201
- https://oscar-corpus.com/
- https://arxiv.org/abs/2103.06678
- https://zenodo.org/record/3891466#.YEX4-F0zbzc
- https://github.com/google-research/bert
- https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297
- https://github.com/CAMeL-Lab/camel_tools
- https://github.com/CAMeL-Lab/CAMeLBERT
---
layout: model
title: Split Sentences in Healthcare Texts
author: John Snow Labs
name: sentence_detector_dl_healthcare
class: DeepSentenceDetector
language: en
nav_key: models
repository: clinical/models
date: 2020-09-13
task: Sentence Detection
edition: Healthcare NLP 2.6.0
spark_version: 2.4
tags: [clinical,sentence_detection,en]
supported: true
annotator: SentenceDetectorDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/SENTENCE_DETECTOR_HC/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_2.6.0_2.4_1600001082565.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_2.6.0_2.4_1600001082565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentences")
sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL]))
sd_model.fullAnnotate("""John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.""")
```
```scala
val documenter = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val pipeline = new Pipeline().setStages(Array(documenter, model))
val data = Seq("John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.detect_sentence.clinical").predict("""John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.""")
```
{:.h2_title}
## Results
```bash
+---+------------------------------+
| 0 | John loves Mary. |
+---+------------------------------+
| 1 | Mary loves Peter |
+---+------------------------------+
| 2 | Peter loves Helen . |
+---+------------------------------+
| 3 | Helen loves John; |
+---+------------------------------+
| 4 | Total: four people involved. |
+---+------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---------------|-------------------------------------------|
| Name: | sentence_detector_dl_healthcare |
| Type: | DeepSentenceDetector |
| Compatibility: | Spark NLP 2.6.0+ |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [document] |
|Output labels: | sentence |
| Language: | en |
{:.h2_title}
## Data Source
Healthcare SDDL model is trained on domain (healthcare) specific text, annotated internally, to generalize further on clinical notes.
{:.h2_title}
## Benchmarking
```bash
label Accuracy Recall Prec F1
0 0.98 1.00 0.96 0.98
```
---
layout: model
title: Finnish Lemmatizer
author: John Snow Labs
name: lemma
date: 2020-05-05 12:35:00 +0800
task: Lemmatization
language: fi
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [lemmatizer, fi]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_fi_2.5.0_2.4_1588671290521.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_fi_2.5.0_2.4_1588671290521.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
lemmatizer = LemmatizerModel.pretrained("lemma", "fi") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.")
```
```scala
...
val lemmatizer = LemmatizerModel.pretrained("lemma", "fi")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä."""]
lemma_df = nlu.load('fi.lemma').predict(text, output_level='document')
lemma_df.lemma.values[0]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=2, result='se', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=4, end=10, result='lisäksi', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=11, end=11, result=',', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=13, end=16, result='että', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=18, end=20, result='hän', metadata={'sentence': '0'}, embeddings=[]),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Type:|lemmatizer|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[lemma]|
|Language:|fi|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Google T5 (Text-To-Text Transfer Transformer) Small
author: John Snow Labs
name: t5_small
date: 2021-01-08
task: [Question Answering, Summarization, Translation]
language: en
nav_key: models
edition: Spark NLP 2.7.1
spark_version: 2.4
tags: [open_source, t5, summarization, translation, en, seq2seq]
supported: true
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The T5 transformer model described in the seminal paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". This model can perform a variety of tasks, such as text summarization, question answering, and translation. More details about using the model can be found in the paper (https://arxiv.org/pdf/1910.10683.pdf).
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/T5TRANSFORMER/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5TRANSFORMER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_en_2.7.1_2.4_1610133219885.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_en_2.7.1_2.4_1610133219885.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Either set the following tasks or have them inline with your input:
- summarize:
- translate English to German:
- translate English to French:
- stsb sentence1: Big news. sentence2: No idea.
The full list of tasks is in the Appendix of the paper: https://arxiv.org/pdf/1910.10683.pdf
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
t5 = T5Transformer() \
.pretrained("t5_small") \
.setTask("summarize:")\
.setMaxOutputLength(200)\
.setInputCols(["documents"]) \
.setOutputCol("summaries")
pipeline = Pipeline().setStages([document_assembler, t5])
results = pipeline.fit(data_df).transform(data_df)
results.select("summaries.result").show(truncate=False)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("documents")
val t5 = T5Transformer
.pretrained("t5_small")
.setTask("summarize:")
.setInputCols(Array("documents"))
.setOutputCol("summaries")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val model = pipeline.fit(dataDf)
val results = model.transform(dataDf)
results.select("summaries.result").show(truncate = false)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.t5.small").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_small|
|Compatibility:|Spark NLP 2.7.1+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[t5]|
|Language:|en|
## Data Source
https://huggingface.co/t5-small
---
layout: model
title: Translate English to North Germanic languages Pipeline
author: John Snow Labs
name: translate_en_gmq
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, gmq, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `gmq`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_gmq_xx_2.7.0_2.4_1609698966439.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_gmq_xx_2.7.0_2.4_1609698966439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_gmq", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_gmq", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.gmq').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_gmq|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering model (from gdario)
author: John Snow Labs
name: bert_qa_biobert_bioasq
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert_bioasq` is a English model orginally trained by `gdario`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_bioasq_en_4.0.0_3.0_1654185669067.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_bioasq_en_4.0.0_3.0_1654185669067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_bioasq","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_biobert_bioasq","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.biobert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_biobert_bioasq|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|403.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/gdario/biobert_bioasq
---
layout: model
title: English asr_wav2vec2_xls_r_300m_kh TFWav2Vec2ForCTC from kongkeaouch
author: John Snow Labs
name: pipeline_asr_wav2vec2_xls_r_300m_kh
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_kh` is a English model originally trained by kongkeaouch.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_kh_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_kh_en_4.2.0_3.0_1664025155355.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_kh_en_4.2.0_3.0_1664025155355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_kh', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_kh", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xls_r_300m_kh|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Pledge Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_pledge_agreement_bert
date: 2022-12-06
tags: [en, legal, classification, agreement, pledge, licensed, bert, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_pledge_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `pledge-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`pledge-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_pledge_agreement_bert_en_1.0.0_3.0_1670349668991.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_pledge_agreement_bert_en_1.0.0_3.0_1670349668991.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[pledge-agreement]|
|[other]|
|[other]|
|[pledge-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_pledge_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.96 0.98 0.97 65
pledge-agreement 0.96 0.88 0.92 26
accuracy - - 0.96 91
macro-avg 0.96 0.93 0.94 91
weighted-avg 0.96 0.96 0.96 91
```
---
layout: model
title: Chinese Bert Embeddings (Roberta, Whole Word Masking)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_wwm_ext
date: 2022-04-11
tags: [bert, embeddings, zh, open_source]
task: Embeddings
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-roberta-wwm-ext` is a Chinese model orginally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_ext_zh_3.4.2_3.0_1649668840010.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_ext_zh_3.4.2_3.0_1649668840010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.chinese_roberta_wwm_ext").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_wwm_ext|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|383.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/chinese-roberta-wwm-ext
- https://arxiv.org/abs/1906.08101
- https://github.com/google-research/bert
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/1906.08101
---
layout: model
title: English asr_model_2 TFWav2Vec2ForCTC from niclas
author: John Snow Labs
name: pipeline_asr_model_2
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_2` is a English model originally trained by niclas.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_model_2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_2_en_4.2.0_3.0_1664097773728.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_2_en_4.2.0_3.0_1664097773728.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_model_2', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_model_2", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_model_2|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from google)
author: John Snow Labs
name: t5_efficient_base_el16
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-el16` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el16_en_4.3.0_3.0_1675110985082.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el16_en_4.3.0_3.0_1675110985082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_el16","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_el16","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_el16|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|529.5 MB|
## References
- https://huggingface.co/google/t5-efficient-base-el16
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Legal Financing And Investment Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_financing_and_investment_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, financing_and_investment, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_financing_and_investment_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Financing_and_Investment or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Financing_and_Investment`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_financing_and_investment_bert_en_1.0.0_3.0_1678111700202.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_financing_and_investment_bert_en_1.0.0_3.0_1678111700202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Financing_and_Investment]|
|[Other]|
|[Other]|
|[Financing_and_Investment]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_financing_and_investment_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Financing_and_Investment 0.84 0.91 0.87 45
Other 0.91 0.84 0.87 50
accuracy - - 0.87 95
macro-avg 0.87 0.88 0.87 95
weighted-avg 0.88 0.87 0.87 95
```
---
layout: model
title: Javanese DistilBERT Embeddings (Small, Imdb)
author: John Snow Labs
name: distilbert_embeddings_javanese_distilbert_small_imdb
date: 2022-04-12
tags: [distilbert, embeddings, jv, open_source]
task: Embeddings
language: jv
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `javanese-distilbert-small-imdb` is a Javanese model orginally trained by `w11wo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_distilbert_small_imdb_jv_3.4.2_3.0_1649783783892.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_distilbert_small_imdb_jv_3.4.2_3.0_1649783783892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_distilbert_small_imdb","jv") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_distilbert_small_imdb","jv")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("jv.embed.javanese_distilbert_small_imdb").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_javanese_distilbert_small_imdb|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|jv|
|Size:|248.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/w11wo/javanese-distilbert-small-imdb
- https://arxiv.org/abs/1910.01108
- https://github.com/sgugger
- https://w11wo.github.io/
---
layout: model
title: Translate Shona to English Pipeline
author: John Snow Labs
name: translate_sn_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, sn, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `sn`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sn_en_xx_2.7.0_2.4_1609688774813.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sn_en_xx_2.7.0_2.4_1609688774813.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_sn_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_sn_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.sn.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_sn_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for RxNorm According to National Institute of Health (NIH) Database (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_rxnorm_nih
date: 2023-02-22
tags: [entity_resolution, rxnorm, clinical, en, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 4.3.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes according to the National Institute of Health (NIH) database using `sbiobert_base_cased_mli` Sentence Bert Embeddings.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_nih_en_4.3.0_3.0_1677106956679.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_nih_en_4.3.0_3.0_1677106956679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(['DRUG'])\
.setPreservePosition(False)
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_nih","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
chunk2doc,
sbert_embedder,
rxnorm_resolver])
data = spark.createDataFrame([["""She is given folic acid 1 mg daily , levothyroxine 0.1 mg and aspirin 81 mg daily ."""]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("entities")
val chunk2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("ner_chunk_doc")
.setOutputCol("sbert_embeddings")
val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_nih","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver))
val data = Seq("""She is given folic acid 1 mg daily , levothyroxine 0.1 mg and aspirin 81 mg daily and metformin 100 mg, coumadin 5 mg.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| | sent_id | ner_chunk | entity | rxnorm_code | all_codes | resolutions |
|---:|----------:|:---------------------|:---------|--------------:|:------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------|
| 0 | 0 | folic acid 1 mg | DRUG | 12281181 | ['12281181', '12283696', '12270292', '12306595', 1227889...| ['folic acid 1 MG [folic acid 1 MG]', 'folic acid 1.1 MG [folic acid 1.1 MG]', 'folic acid 1 MG/ML [folic acid 1 MG/ML]', 'folic a...|
| 1 | 0 | levothyroxine 0.1 mg | DRUG | 12275630 | ['12275630', '12275646', '12301585', '12306484', 1235044...| ['levothyroxine sodium 0.1 MG [levothyroxine sodium 0.1 MG]', 'levothyroxine sodium 0.01 MG [levothyroxine sodium 0.01 MG]', 'levo...|
| 2 | 0 | aspirin 81 mg | DRUG | 12278696 | ['12278696', '12299811', '12298729', '12311168', '1230631...| ['aspirin 81 MG [aspirin 81 MG]', 'aspirin 81 MG [YSP Aspirin] [aspirin 81 MG [YSP Aspirin]]', 'aspirin 81 MG [Med Aspirin] [aspir...|
| 3 | 0 | metformin 100 mg | DRUG | 12282749 | ['12282749', '3735316', '12279966', '1509573', '3736179'... | ['metformin hydrochloride 100 MG/ML [metformin hydrochloride 100 MG/ML]', 'metFORMIN hydrochloride 100 MG/ML [metFORMIN hydrochlor...|
| 4 | 0 | coumadin 5 mg | DRUG | 1768579 | ['1768579', '12534260', '1780903', '1768951', '1510873' ... | ['coumarin 5 MG [coumarin 5 MG]', 'vericiguat 5 MG [vericiguat 5 MG]', 'pridinol 5 MG [pridinol 5 MG]', 'propinox 5 MG [propinox 5...|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_rxnorm_nih|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[rxnorm_code]|
|Language:|en|
|Size:|818.8 MB|
|Case sensitive:|false|
## References
Trained on February 2023 with `sbiobert_base_cased_mli` embeddings.
https://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html
---
layout: model
title: Translate Gun to English Pipeline
author: John Snow Labs
name: translate_guw_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, guw, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `guw`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_guw_en_xx_2.7.0_2.4_1609688796603.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_guw_en_xx_2.7.0_2.4_1609688796603.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_guw_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_guw_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.guw.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_guw_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Italian Named Entity Recognition (from gunghio)
author: John Snow Labs
name: xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner
date: 2022-05-17
tags: [xlm_roberta, ner, token_classification, it, open_source]
task: Named Entity Recognition
language: it
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-ner` is a Italian model orginally trained by `gunghio`.
## Predicted Entities
`LOC`, `ORG`, `PER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner_it_3.4.2_3.0_1652808069992.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner_it_3.4.2_3.0_1652808069992.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner","it") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner","it")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Adoro Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|it|
|Size:|878.3 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/gunghio/xlm-roberta-base-finetuned-panx-ner
---
layout: model
title: Pipeline to Detect Normalized Genes and Human Phenotypes
author: John Snow Labs
name: ner_human_phenotype_gene_clinical_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, gene, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_human_phenotype_gene_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_human_phenotype_gene_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_pipeline_en_3.4.1_3.0_1647867667569.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_pipeline_en_3.4.1_3.0_1647867667569.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_human_phenotype_gene_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).")
```
```scala
val pipeline = new PretrainedPipeline("ner_human_phenotype_gene_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.human_phnotype_gene_clinical.pipeline").predict("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""")
```
## Results
```bash
+----+------------------+---------+-------+----------+
| | chunk | begin | end | entity |
+====+==================+=========+=======+==========+
| 0 | BS type | 29 | 32 | GENE |
+----+------------------+---------+-------+----------+
| 1 | polyhydramnios | 75 | 88 | HP |
+----+------------------+---------+-------+----------+
| 2 | polyuria | 91 | 98 | HP |
+----+------------------+---------+-------+----------+
| 3 | nephrocalcinosis | 101 | 116 | HP |
+----+------------------+---------+-------+----------+
| 4 | hypokalemia | 122 | 132 | HP |
+----+------------------+---------+-------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_human_phenotype_gene_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739428334.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739428334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base_rule_based_hier_quadruplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|460.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0
---
layout: model
title: Chinese BertForMaskedLM Cased model (from qinluo)
author: John Snow Labs
name: bert_embeddings_wo_chinese_plus
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `wobert-chinese-plus` is a Chinese model originally trained by `qinluo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_wo_chinese_plus_zh_4.2.4_3.0_1670327320217.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_wo_chinese_plus_zh_4.2.4_3.0_1670327320217.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_wo_chinese_plus","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_wo_chinese_plus","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_wo_chinese_plus|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|467.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/qinluo/wobert-chinese-plus
- https://github.com/ZhuiyiTechnology/WoBERT
- https://github.com/JunnYu/WoBERT_pytorch
---
layout: model
title: Chunk Entity Resolver RxNorm-scdc
author: John Snow Labs
name: chunkresolve_rxnorm_scdc_healthcare
date: 2021-04-16
tags: [entity_resolution, clinical, licensed, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to RxNorm codes using chunk embeddings (augmented with synonyms, four times richer than previous resolver).
## Predicted Entities
RxNorm codes
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_RXNORM/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_scdc_healthcare_en_3.0.0_3.0_1618605170280.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_scdc_healthcare_en_3.0.0_3.0_1618605170280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_scdc_healthcare","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity")
pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, resolver])
data = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."]]).toDF("text")
model = pipeline.fit(data)
results = model.transform(data)
...
```
```scala
...
val resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_scdc_healthcare","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity")
val pipeline = new Pipeline().setStages(Array(document_assembler, sbert_embedder, resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text")
val result = pipeline.fit(data).transform(data)
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
| chunk| entity| target_text| code|confidence|
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
| metformin|TREATMENT|metFORMIN compounding powder:::Metformin Hydrochloride Powder:::metFORMIN 500 mg oral tablet:::me...| 601021| 0.2364|
| glipizide|TREATMENT|Glipizide Powder:::Glipizide Crystal:::Glipizide Tablets:::glipiZIDE 5 mg oral tablet:::glipiZIDE...| 241604| 0.3647|
|dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|Ezetimibe and Atorvastatin Tablets:::Amlodipine and Atorvastatin Tablets:::Atorvastatin Calcium T...|1422084| 0.3407|
| dapagliflozin|TREATMENT|Dapagliflozin Tablets:::dapagliflozin 5 mg oral tablet:::dapagliflozin 10 mg oral tablet:::Dapagl...|1488568| 0.7070|
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|chunkresolve_rxnorm_scdc_healthcare|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[token, chunk_embeddings]|
|Output Labels:|[rxnorm_code]|
|Language:|en|
---
layout: model
title: Italian Electra Embeddings (from dbmdz)
author: John Snow Labs
name: electra_embeddings_electra_base_italian_xxl_cased_generator
date: 2022-05-17
tags: [it, open_source, electra, embeddings]
task: Embeddings
language: it
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-italian-xxl-cased-generator` is a Italian model orginally trained by `dbmdz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_italian_xxl_cased_generator_it_3.4.4_3.0_1652786574536.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_italian_xxl_cased_generator_it_3.4.4_3.0_1652786574536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_italian_xxl_cased_generator","it") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_italian_xxl_cased_generator","it")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Adoro Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_electra_base_italian_xxl_cased_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|it|
|Size:|128.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator
- http://opus.nlpl.eu/
- https://traces1.inria.fr/oscar/
- https://github.com/dbmdz/berts/issues/7
- https://github.com/stefan-it/turkish-bert/tree/master/electra
- https://github.com/stefan-it/italian-bertelectra
- https://github.com/dbmdz/berts/issues/new
---
layout: model
title: Smaller BERT Embeddings (L-12_H-512_A-8)
author: John Snow Labs
name: small_bert_L12_512
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L12_512_en_2.6.0_2.4_1598344865471.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L12_512_en_2.6.0_2.4_1598344865471.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("small_bert_L12_512", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("small_bert_L12_512", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert.small_L12_512').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_bert_small_L12_512_embeddings
I [0.5089142322540283, -0.21703988313674927, -0....
love [-0.3273950517177582, 0.9550480842590332, -0.1...
NLP [0.3552919626235962, 0.3629235625267029, 0.891...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|small_bert_L12_512|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|512|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1
---
layout: model
title: Arabic Bert Embeddings (Base, Arabert Model)
author: John Snow Labs
name: bert_embeddings_bert_base_arabert
date: 2022-04-11
tags: [bert, embeddings, ar, open_source]
task: Embeddings
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabert` is a Arabic model orginally trained by `aubmindlab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabert_ar_3.4.2_3.0_1649677303708.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabert_ar_3.4.2_3.0_1649677303708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabert","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabert","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب شرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.embed.bert_base_arabert").predict("""أنا أحب شرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_arabert|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|507.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/aubmindlab/bert-base-arabert
- https://github.com/google-research/bert
- https://arxiv.org/abs/2003.00104
- https://github.com/WissamAntoun/pydata_khobar_meetup
- http://alt.qcri.org/farasa/segmenter.html
- /aubmindlab/bert-base-arabert/resolve/main/(https://github.com/google-research/bert/blob/master/multilingual.md)
- https://github.com/elnagara/HARD-Arabic-Dataset
- https://www.aclweb.org/anthology/D15-1299
- https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf
- https://github.com/mohamedadaly/LABR
- http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp
- https://github.com/husseinmozannar/SOQAL
- https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md
- https://arxiv.org/abs/2003.00104v2
- https://archive.org/details/arwiki-20190201
- https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4
- https://www.aclweb.org/anthology/W19-4619
- https://sites.aub.edu.lb/mindlab/
- https://www.yakshof.com/#/
- https://www.behance.net/rahalhabib
- https://www.linkedin.com/in/wissam-antoun-622142b4/
- https://twitter.com/wissam_antoun
- https://github.com/WissamAntoun
- https://www.linkedin.com/in/fadybaly/
- https://twitter.com/fadybaly
- https://github.com/fadybaly
---
layout: model
title: Pipeline to Summarize Clinical Question Notes
author: John Snow Labs
name: summarizer_clinical_questions_pipeline
date: 2023-05-31
tags: [licensed, en, clinical, summarization, question]
task: Summarization
language: en
edition: Healthcare NLP 4.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [summarizer_clinical_questions](https://nlp.johnsnowlabs.com/2023/04/03/summarizer_clinical_questions_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_pipeline_en_4.4.1_3.0_1685530642775.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_pipeline_en_4.4.1_3.0_1685530642775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("summarizer_clinical_questions_pipeline", "en", "clinical/models")
text = """
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
"""
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("summarizer_clinical_questions_pipeline", "en", "clinical/models")
val text = """
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
"""
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
What are the treatments for hyperthyroidism?
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|summarizer_clinical_questions_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|936.7 MB|
## Included Models
- DocumentAssembler
- MedicalSummarizer
---
layout: model
title: Word2Vec Embeddings in Aragonese (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-14
tags: [cc, embeddings, fastText, word2vec, an, open_source]
task: Embeddings
language: an
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_an_3.4.1_3.0_1647282522053.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_an_3.4.1_3.0_1647282522053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","an") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","an")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("an.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|an|
|Size:|212.8 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Pipeline to Classify Texts into 4 News Categories
author: John Snow Labs
name: bert_sequence_classifier_age_news_pipeline
date: 2022-02-23
tags: [ag_news, news, bert, bert_sequence, classification, en, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 3.4.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_sequence_classifier_age_news_en](https://nlp.johnsnowlabs.com/2021/11/07/bert_sequence_classifier_age_news_en.html) which is imported from `HuggingFace`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_pipeline_en_3.4.0_3.0_1645616467835.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_pipeline_en_3.4.0_3.0_1645616467835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
news_pipeline = PretrainedPipeline("bert_sequence_classifier_age_news_pipeline", lang = "en")
news_pipeline.annotate("Microsoft has taken its first step into the metaverse.")
```
```scala
val news_pipeline = new PretrainedPipeline("bert_sequence_classifier_age_news_pipeline", lang = "en")
news_pipeline.annotate("Microsoft has taken its first step into the metaverse.")
```
## Results
```bash
['Sci/Tech']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_age_news_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.4.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|42.4 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- BertForSequenceClassification
---
layout: model
title: English asr_wav2vec2_ksponspeech TFWav2Vec2ForCTC from Taeham
author: John Snow Labs
name: pipeline_asr_wav2vec2_ksponspeech
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_ksponspeech` is a English model originally trained by Taeham.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_ksponspeech_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_ksponspeech_en_4.2.0_3.0_1664102640740.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_ksponspeech_en_4.2.0_3.0_1664102640740.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_ksponspeech', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_ksponspeech", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_ksponspeech|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Italian T5ForConditionalGeneration Small Cased model (from efederici)
author: John Snow Labs
name: t5_it5_efficient_small_lfqa
date: 2023-01-30
tags: [it, open_source, t5, tensorflow]
task: Text Generation
language: it
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-lfqa` is a Italian model originally trained by `efederici`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_lfqa_it_4.3.0_3.0_1675103827826.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_lfqa_it_4.3.0_3.0_1675103827826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_it5_efficient_small_lfqa","it") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_it5_efficient_small_lfqa","it")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_it5_efficient_small_lfqa|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|it|
|Size:|594.0 MB|
## References
- https://huggingface.co/efederici/it5-efficient-small-lfqa
---
layout: model
title: Legal NER (Parties, Dates, Document Type - sm)
author: John Snow Labs
name: legner_contract_doc_parties
date: 2022-08-16
tags: [en, legal, ner, agreements, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
IMPORTANT: Don't run this model on the whole legal agreement. Instead:
- Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration;
- Use the `legclf_introduction_clause` Text Classifier to select only these paragraphs;
This is a Legal NER Model, aimed to process the first page of the agreements when information can be found about:
- Parties of the contract/agreement;
- Aliases of those parties, or how those parties will be called further on in the document;
- Document Type;
- Effective Date of the agreement;
This model can be used all along with its Relation Extraction model to retrieve the relations between these entities, called `legre_contract_doc_parties`
Other models can be found to detect other parts of the document, as Headers/Subheaders, Signers, "Will-do", etc.
## Predicted Entities
`PARTY`, `EFFDATE`, `DOC`, `ALIAS`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/LEGALNER_PARTIES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_contract_doc_parties_en_1.0.0_3.2_1660647946284.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_contract_doc_parties_en_1.0.0_3.2_1660647946284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""
INTELLECTUAL PROPERTY AGREEMENT
This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
"""]
res = model.transform(spark.createDataFrame([text]).toDF("text"))
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_squad2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_tiny_squad2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|307.2 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/deepset/tinyroberta-squad2
- https://haystack.deepset.ai/tutorials/first-qa-system
- https://arxiv.org/pdf/1909.10351.pdf
- https://github.com/deepset-ai/haystack
- https://github.com/deepset-ai/haystack/
- https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
- http://deepset.ai/
- https://haystack.deepset.ai/
- https://deepset.ai/german-bert
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/haystack
- https://docs.haystack.deepset.ai
- https://haystack.deepset.ai/community/join
- https://twitter.com/deepset_ai
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- http://www.deepset.ai/jobs
- https://paperswithcode.com/sota?task=Question+Answering&dataset=squad_v2
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_4_h_768
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-768` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_768_zh_4.2.4_3.0_1670325955053.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_768_zh_4.2.4_3.0_1670325955053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_768","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_768","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_4_h_768|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|170.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-4_H-768
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: English Biomedical ElectraForQuestionAnswering model
author: John Snow Labs
name: electra_qa_BioM_Base_SQuAD2_BioASQ8B
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
recommended: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Biomedical Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BioM-ELECTRA-Base-SQuAD2-BioASQ8B` is a English model originally trained by `sultan`.
This model is fine-tuned on the SQuAD2.0 dataset and then on the BioASQ8B-Factoid training dataset. We convert the BioASQ8B-Factoid training dataset to SQuAD1.1 format and train and evaluate our model (BioM-ELECTRA-Base-SQuAD2) on this dataset.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Base_SQuAD2_BioASQ8B_en_4.0.0_3.0_1655918942331.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Base_SQuAD2_BioASQ8B_en_4.0.0_3.0_1655918942331.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_BioM_Base_SQuAD2_BioASQ8B","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_BioM_Base_SQuAD2_BioASQ8B","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_bioasq8b.electra.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_BioM_Base_SQuAD2_BioASQ8B|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|403.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sultan/BioM-ELECTRA-Base-SQuAD2-BioASQ8B
---
layout: model
title: Fast and Accurate Language Identification - 43 Languages (CNN)
author: John Snow Labs
name: ld_wiki_tatoeba_cnn_43
date: 2020-12-05
task: Language Detection
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [language_detection, open_source, xx]
supported: true
annotator: LanguageDetectorDL
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate.
We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias).
This model can detect the following languages:
`Arabic`, `Belarusian`, `Bulgarian`, `Czech`, `Danish`, `German`, `Greek`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Persian`, `Finnish`, `French`, `Hebrew`, `Hindi`, `Hungarian`, `Interlingua`, `Indonesian`, `Icelandic`, `Italian`, `Japanese`, `Korean`, `Latin`, `Lithuanian`, `Latvian`, `Macedonian`, `Marathi`, `Dutch`, `Polish`, `Portuguese`, `Romanian`, `Russian`, `Slovak`, `Slovenian`, `Serbian`, `Swedish`, `Tagalog`, `Turkish`, `Tatar`, `Ukrainian`, `Vietnamese`, `Chinese`.
## Predicted Entities
`ar`, `be`, `bg`, `cs`, `da`, `de`, `el`, `en`, `eo`, `es`, `et`, `fa`, `fi`, `fr`, `he`, `hi`, `hu`, `ia`, `id`, `is`, `it`, `ja`, `ko`, `la`, `lt`, `lv`, `mk`, `mr`, `nl`, `pl`, `pt`, `ro`, `ru`, `sk`, `sl`, `sr`, `sv`, `tl`, `tr`, `tt`, `uk`, `vi`, `zh`.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_43_xx_2.7.0_2.4_1607184003726.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_43_xx_2.7.0_2.4_1607184003726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
language_detector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_43", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("language")
languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector])
light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text")))
result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.")
```
```scala
...
val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_43", "xx")
.setInputCols("sentence")
.setOutputCol("language")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector))
val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."]
lang_df = nlu.load('xx.classify.wiki_43').predict(text, output_level='sentence')
lang_df
```
## Results
```bash
'fr'
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ld_wiki_tatoeba_cnn_43|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[language]|
|Language:|xx|
## Data Source
Wikipedia and Tatoeba
## Benchmarking
```bash
Evaluated on Europarl dataset which the model has never seen:
+--------+-----+-------+------------------+
|src_lang|count|correct| precision|
+--------+-----+-------+------------------+
| fr| 1000| 1000| 1.0|
| nl| 1000| 999| 0.999|
| sv| 1000| 999| 0.999|
| pt| 1000| 999| 0.999|
| it| 1000| 999| 0.999|
| es| 1000| 999| 0.999|
| fi| 1000| 999| 0.999|
| el| 1000| 998| 0.998|
| de| 1000| 997| 0.997|
| da| 1000| 997| 0.997|
| en| 1000| 995| 0.995|
| lt| 1000| 986| 0.986|
| hu| 880| 867|0.9852272727272727|
| pl| 914| 899|0.9835886214442013|
| ro| 784| 765|0.9757653061224489|
| et| 928| 899| 0.96875|
| cs| 1000| 967| 0.967|
| sk| 1000| 966| 0.966|
| bg| 1000| 960| 0.96|
| sl| 914| 860|0.9409190371991247|
| lv| 916| 856|0.9344978165938864|
+--------+-----+-------+------------------+
+-------+--------------------+
|summary| precision|
+-------+--------------------+
| count| 21|
| mean| 0.9832737168612825|
| stddev|0.020064155103808722|
| min| 0.9344978165938864|
| max| 1.0|
+-------+--------------------+
```
---
layout: model
title: Legal Law Area Prediction Classifier (French)
author: John Snow Labs
name: legclf_law_area_prediction_french
date: 2023-03-29
tags: [fr, licensed, classification, legal, tensorflow]
task: Text Classification
language: fr
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Multiclass classification model which identifies law area labels(civil_law, penal_law, public_law, social_law) in French-based Court Cases.
## Predicted Entities
`civil_law`, `penal_law`, `public_law`, `social_law`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_law_area_prediction_french_fr_1.0.0_3.0_1680094841099.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_law_area_prediction_french_fr_1.0.0_3.0_1680094841099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_multi_cased", "xx")\
.setInputCols(["document"]) \
.setOutputCol("sentence_embeddings")
docClassifier = legal.ClassifierDLModel.pretrained("legclf_law_area_prediction_french", "fr", "legal/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
embeddings,
docClassifier
])
df = spark.createDataFrame([["par ces motifs, le Juge unique prononce : 1. Le recours est irrecevable. 2. Il n'est pas perçu de frais judiciaires. 3. Le présent arrêt est communiqué aux parties, au Tribunal administratif fédéral et à l'Office fédéral des assurances sociales. Lucerne, le 2 juin 2016 Au nom de la IIe Cour de droit social du Tribunal fédéral suisse Le Juge unique : Meyer Le Greffier : Cretton"]]).toDF("text")
model = nlpPipeline.fit(df)
result = model.transform(df)
result.select("text", "category.result").show(truncate=100)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+------------+
| text| result|
+----------------------------------------------------------------------------------------------------+------------+
|par ces motifs, le Juge unique prononce : 1. Le recours est irrecevable. 2. Il n'est pas perçu de...|[social_law]|
+----------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_law_area_prediction_french|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|fr|
|Size:|22.3 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/rcds/legal_criticality_prediction)
## Benchmarking
```bash
label precision recall f1-score support
civil_law 0.93 0.91 0.92 613
penal_law 0.94 0.96 0.95 579
public_law 0.92 0.91 0.92 605
social_law 0.97 0.98 0.97 478
accuracy - - 0.94 2275
macro-avg 0.94 0.94 0.94 2275
weighted-avg 0.94 0.94 0.94 2275
```
---
layout: model
title: English BertForQuestionAnswering model (from hendrixcosta)
author: John Snow Labs
name: bert_qa_bertimbau_squad1.1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertimbau-squad1.1` is a English model orginally trained by `hendrixcosta`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bertimbau_squad1.1_en_4.0.0_3.0_1654185392313.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bertimbau_squad1.1_en_4.0.0_3.0_1654185392313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bertimbau_squad1.1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bertimbau_squad1.1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.by_hendrixcosta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bertimbau_squad1.1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.2 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/hendrixcosta/bertimbau-squad1.1
---
layout: model
title: Classification of Self-Reported Intimate Partner Violence (BioBERT)
author: John Snow Labs
name: bert_sequence_classifier_self_reported_partner_violence_tweet
date: 2022-07-28
tags: [sequence_classification, bert, classifier, clinical, en, licensed, public_health, partner_violence, tweet]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Classification of Self-Reported Intimate Partner Violence on Twitter. This model involves the detection the potential IPV victims on social media platforms (in English tweets).
## Predicted Entities
`intimate_partner_violence`, `non-intimate_partner_violence`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_PARTNER_VIOLENCE/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_partner_violence_tweet_en_4.0.0_3.0_1658999356448.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_partner_violence_tweet_en_4.0.0_3.0_1658999356448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_partner_violence_tweet", "en", "clinical/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
example = spark.createDataFrame(["I am fed up with this toxic relation.I hate my husband.",
"Can i say something real quick I ve never been one to publicly drag an ex partner and sometimes I regret that. I ve been reflecting on the harm, abuse and violence that was done to me and those bitches are truly lucky I chose peace amp therapy because they are trash forreal."], StringType()).toDF("text")
result = pipeline.fit(example).transform(example)
result.select("text", "class.result").show(truncate=False)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_partner_violence_tweet", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
# couple of simple examples
val example = Seq(Array("I am fed up with this toxic relation.I hate my husband.",
"Can i say something real quick I ve never been one to publicly drag an ex partner and sometimes I regret that. I ve been reflecting on the harm, abuse and violence that was done to me and those bitches are truly lucky I chose peace amp therapy because they are trash forreal.")).toDF("text")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.self_reported_partner_violence").predict("""Can i say something real quick I ve never been one to publicly drag an ex partner and sometimes I regret that. I ve been reflecting on the harm, abuse and violence that was done to me and those bitches are truly lucky I chose peace amp therapy because they are trash forreal.""")
```
## Results
```bash
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+
|text |result |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+
|I am fed up with this toxic relation.I hate my husband. |[non-intimate_partner_violence]|
|Can i say something real quick I ve never been one to publicly drag an ex partner and sometimes I regret that. I ve been reflecting on the harm, abuse and violence that was done to me and those bitches are truly lucky I chose peace amp therapy because they are trash forreal.|[intimate_partner_violence] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_self_reported_partner_violence_tweet|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
[SMM4H 2022](https://healthlanguageprocessing.org/smm4h-2022/)
## Benchmarking
```bash
label precision recall f1-score support
intimate_partner_violence 0.96 0.97 0.97 630
non-intimate_partner_violence 0.75 0.69 0.72 78
accuracy - - 0.94 708
macro-avg 0.86 0.83 0.84 708
weighted-avg 0.94 0.94 0.94 708
```
---
layout: model
title: Clinical Deidentification (Spanish)
author: John Snow Labs
name: clinical_deidentification
date: 2022-02-17
tags: [deid, es, licensed,clinical]
task: De-identification
language: es
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline is trained with sciwiki_300d embeddings and can be used to deidentify PHI information from medical texts in Spanish. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `E-MAIL`, `USERNAME`, `LOCATION`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `MEDICALRECORD`, `IDNUM`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_3.4.1_3.0_1645118722536.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_3.4.1_3.0_1645118722536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from johnsnowlabs import *
deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models")
sample = """Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
"""
result = deid_pipeline .annotate(sample)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "es", "clinical/models")
sample = "Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
"
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.deid.clinical").predict("""Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
""")
```
## Results
```bash
Masked with entity labels
------------------------------
Datos del paciente.
Nombre: .
Apellidos: .
NHC: .
NASS: 04.
Domicilio: , 5 B..
Localidad/ Provincia: .
CP: .
Datos asistenciales.
Fecha de nacimiento: .
País: .
Edad: años Sexo: .
Fecha de Ingreso: .
: María Merino Viveros NºCol: .
Informe clínico del paciente: de años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
Servicio de Endocrinología y Nutrición km 12,500 28905 - () Correo electrónico:
Masked with chars
------------------------------
Datos del paciente.
Nombre: [**] .
Apellidos: [*************].
NHC: [*****].
NASS: ** [******] 04.
Domicilio: [*******************], 5 B..
Localidad/ Provincia: [****].
CP: [***].
Datos asistenciales.
Fecha de nacimiento: [********].
País: [****].
Edad: ** años Sexo: *.
Fecha de Ingreso: [********].
[****]: María Merino Viveros NºCol: ** ** [***].
Informe clínico del paciente: [***] de ** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en [*********] en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: [*************] +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente [****] significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
[******************] [******************************] Servicio de Endocrinología y Nutrición [*****************] km 12,500 28905 [****] - [****] ([****]) Correo electrónico: [********************]
Masked with fixed length chars
------------------------------
Datos del paciente.
Nombre: **** .
Apellidos: ****.
NHC: ****.
NASS: **** **** 04.
Domicilio: ****, 5 B..
Localidad/ Provincia: ****.
CP: ****.
Datos asistenciales.
Fecha de nacimiento: ****.
País: ****.
Edad: **** años Sexo: ****.
Fecha de Ingreso: ****.
****: María Merino Viveros NºCol: **** **** ****.
Informe clínico del paciente: **** de **** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en **** en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: **** +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente **** significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
**** **** Servicio de Endocrinología y Nutrición **** km 12,500 28905 **** - **** (****) Correo electrónico: ****
Obfuscated
------------------------------
Datos del paciente.
Nombre: Sr. Lerma .
Apellidos: Aristides Gonzalez Gelabert.
NHC: BBBBBBBBQR648597.
NASS: 041010000011 RZRM020101906017 04.
Domicilio: Valencia, 5 B..
Localidad/ Provincia: Madrid.
CP: 99335.
Datos asistenciales.
Fecha de nacimiento: 25/04/1977.
País: Barcelona.
Edad: 8 años Sexo: F..
Fecha de Ingreso: 02/08/2018.
transportista: María Merino Viveros NºCol: olegario10 olegario10 felisa78.
Informe clínico del paciente: RZRM020101906017 de 8 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en Madrid en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Dra. Laguna +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente 041010000011 significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
Reinaldo Manjón Malo Barcelona Servicio de Endocrinología y Nutrición Valencia km 12,500 28905 Bilbao - Madrid (Barcelona) Correo electrónico: quintanasalome@example.net
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|es|
|Size:|281.3 MB|
## Included Models
- nlp.DocumentAssembler
- nlp.SentenceDetectorDLModel
- nlp.TokenizerModel
- nlp.WordEmbeddingsModel
- medical.NerModel
- nlp.NerConverter
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- medical.DeIdentificationModel
- Finisher
---
layout: model
title: English BertForQuestionAnswering model (from gerardozq)
author: John Snow Labs
name: bert_qa_biobert_v1.1_pubmed_finetuned_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert_v1.1_pubmed-finetuned-squad` is a English model orginally trained by `gerardozq`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_v1.1_pubmed_finetuned_squad_en_4.0.0_3.0_1654185735686.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_v1.1_pubmed_finetuned_squad_en_4.0.0_3.0_1654185735686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_v1.1_pubmed_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_biobert_v1.1_pubmed_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad_pubmed.biobert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_biobert_v1.1_pubmed_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|403.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/gerardozq/biobert_v1.1_pubmed-finetuned-squad
---
layout: model
title: Hocr for table recognition
author: John Snow Labs
name: hocr_table_recognition
date: 2023-01-23
tags: [en, licensed]
task: HOCR Table Recognition
language: en
nav_key: models
edition: Visual NLP 4.2.4
spark_version: 3.2.1
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Table structure recognition based on hocr with Tesseract architecture.
Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset.
In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/ocr/IMAGE_TABLE_RECOGNITION_HOCR/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/tree/master/jupyter/SparkOcrImageTableRecognitionWHOCR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
binary_to_image = BinaryToImage() \
.setInputCol("content") \
.setOutputCol("image")
table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr") \
.setInputCol("image") \
.setOutputCol("table_regions")
splitter = ImageSplitRegions() \
.setInputCol("image") \
.setInputRegionsCol("table_regions") \
.setOutputCol("table_image") \
.setDropCols("image") \
.setImageType(ImageType.TYPE_BYTE_GRAY) \
.setExplodeCols([])
text_detector = ImageTextDetectorV2.pretrained("image_text_detector_v2", "en", "clinical/ocr") \
.setInputCol("image") \
.setOutputCol("text_regions") \
.setWithRefiner(True)
draw_regions = ImageDrawRegions() \
.setInputCol("image") \
.setInputRegionsCol("text_regions") \
.setOutputCol("image_with_regions") \
.setRectColor(Color.green) \
.setRotated(True)
img_to_hocr = ImageToTextV2().pretrained("ocr_small_printed", "en", "clinical/ocr") \
.setInputCols(["image", "text_regions"]) \
.setUsePandasUdf(False) \
.setOutputFormat(OcrOutputFormat.HOCR) \
.setOutputCol("hocr") \
.setGroupImages(False)
hocr_to_table = HocrToTextTable() \
.setInputCol("hocr") \
.setRegionCol("table_regions") \
.setOutputCol("tables")
pipeline = PipelineModel(stages=[
binary_to_image,
table_detector,
splitter,
text_detector,
draw_regions,
img_to_hocr,
hocr_to_table
])
imagePath = "data/tab_images_hocr_1/table4_1.jpg"
image_df= spark.read.format("binaryFile").load(imagePath)
result = pipeline.transform(image_df).cache()
```
```scala
val binary_to_image = new BinaryToImage()
.setInputCol("content")
.setOutputCol("image")
val table_detector = new ImageTableDetector
.pretrained("general_model_table_detection_v2", "en", "clinical/ocr")
.setInputCol("image")
.setOutputCol("table_regions")
val splitter = new ImageSplitRegions()
.setInputCol("image")
.setInputRegionsCol("table_regions")
.setOutputCol("table_image")
.setDropCols("image")
.setImageType(ImageType.TYPE_BYTE_GRAY)
.setExplodeCols(Array())
val text_detector = new ImageTextDetectorV2
.pretrained("image_text_detector_v2", "en", "clinical/ocr")
.setInputCol("image")
.setOutputCol("text_regions")
.setWithRefiner(True)
val draw_regions = new ImageDrawRegions()
.setInputCol("image")
.setInputRegionsCol("text_regions")
.setOutputCol("image_with_regions")
.setRectColor(Color.green)
.setRotated(True)
img_to_hocr = ImageToTextV2()
.pretrained("ocr_small_printed", "en", "clinical/ocr")
.setInputCols(Array("image", "text_regions"))
.setUsePandasUdf(False)
.setOutputFormat(OcrOutputFormat.HOCR)
.setOutputCol("hocr")
.setGroupImages(False)
val hocr_to_table = new HocrToTextTable()
.setInputCol("hocr")
.setRegionCol("table_regions")
.setOutputCol("tables")
val pipeline = new PipelineModel().setStages(Array(
binary_to_image,
table_detector,
splitter,
text_detector,
draw_regions,
img_to_hocr,
hocr_to_table))
val imagePath = "data/tab_images_hocr_1/table4_1.jpg"
val image_df= spark.read.format("binaryFile").load(imagePath)
val result = pipeline.transform(image_df).cache()
```
## Example
{%- capture input_image -%}

{%- endcapture -%}
{%- capture output_image -%}

{%- endcapture -%}
{% include templates/input_output_image.md
input_image=input_image
output_image=output_image
%}
## Output text
```bash
text_regions table_image pagenum modificationTime path table_regions length image image_with_regions hocr tables exception table_index
[{0, 0, 566.32025... {file:/content/ta... 0 2023-01-23 08:21:... file:/content/tab... {0, 0, 40.0, 0.0,... 172124 {file:/content/ta... {file:/content/ta... Live Demo
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_12_en_4.1.0_3.0_1660171578567.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_12_en_4.1.0_3.0_1660171578567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_rust_image_classification_12", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_rust_image_classification_12", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_rust_image_classification_12|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Mongolian RobertaForTokenClassification Base Cased model (from onon214)
author: John Snow Labs
name: roberta_token_classifier_base_ner_demo
date: 2023-03-01
tags: [mn, open_source, roberta, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: mn
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-ner-demo` is a Mongolian model originally trained by `onon214`.
## Predicted Entities
`MISC`, `LOC`, `PER`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_base_ner_demo_mn_4.3.0_3.0_1677703536380.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_base_ner_demo_mn_4.3.0_3.0_1677703536380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_base_ner_demo","mn") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_base_ner_demo","mn")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_token_classifier_base_ner_demo|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|mn|
|Size:|466.3 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/onon214/roberta-base-ner-demo
---
layout: model
title: Fast Neural Machine Translation Model from Igbo to English
author: John Snow Labs
name: opus_mt_ig_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, ig, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `ig`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ig_en_xx_2.7.0_2.4_1609163903644.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ig_en_xx_2.7.0_2.4_1609163903644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ig_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ig_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.ig.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ig_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Right of setoff Clause Binary Classifier
author: John Snow Labs
name: legclf_right_of_setoff_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `right-of-setoff` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `right-of-setoff`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_right_of_setoff_clause_en_1.0.0_3.2_1660122968803.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_right_of_setoff_clause_en_1.0.0_3.2_1660122968803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[right-of-setoff]|
|[other]|
|[other]|
|[right-of-setoff]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_right_of_setoff_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.95 0.99 0.97 94
right-of-setoff 0.94 0.77 0.85 22
accuracy - - 0.95 116
macro-avg 0.95 0.88 0.91 116
weighted-avg 0.95 0.95 0.95 116
```
---
layout: model
title: Legal Erisa Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_erisa_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, erisa, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Erisa` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Erisa`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_erisa_bert_en_1.0.0_3.0_1678050577180.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_erisa_bert_en_1.0.0_3.0_1678050577180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Erisa]|
|[Other]|
|[Other]|
|[Erisa]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_erisa_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Erisa 0.97 1.00 0.99 35
Other 1.00 0.98 0.99 56
accuracy - - 0.99 91
macro-avg 0.99 0.99 0.99 91
weighted-avg 0.99 0.99 0.99 91
```
---
layout: model
title: Turkish ElectraForQuestionAnswering model (from enelpi) Discriminator Version-2
author: John Snow Labs
name: electra_qa_base_discriminator_finetuned_squadv2
date: 2022-06-22
tags: [tr, open_source, electra, question_answering]
task: Question Answering
language: tr
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-discriminator-finetuned_squadv2_tr` is a Turkish model originally trained by `enelpi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_discriminator_finetuned_squadv2_tr_4.0.0_3.0_1655920605376.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_discriminator_finetuned_squadv2_tr_4.0.0_3.0_1655920605376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_discriminator_finetuned_squadv2","tr") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_discriminator_finetuned_squadv2","tr")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("tr.answer_question.squadv2.electra.base_v2").predict("""Benim adım ne?|||"Benim adım Clara ve Berkeley'de yaşıyorum.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_base_discriminator_finetuned_squadv2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|tr|
|Size:|412.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/enelpi/electra-base-discriminator-finetuned_squadv2_tr
---
layout: model
title: English RoBERTa Embeddings (from abhi1nandy2)
author: John Snow Labs
name: roberta_embeddings_Bible_roberta_base
date: 2022-04-14
tags: [roberta, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `Bible-roberta-base` is a English model orginally trained by `abhi1nandy2`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_Bible_roberta_base_en_3.4.2_3.0_1649947380949.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_Bible_roberta_base_en_3.4.2_3.0_1649947380949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_Bible_roberta_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_Bible_roberta_base","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.Bible_roberta_base").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_Bible_roberta_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|468.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/abhi1nandy2/Bible-roberta-base
- https://www.kaggle.com/oswinrh/bible
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_ft_news
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ft_news_en_4.3.0_3.0_1674222981909.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ft_news_en_4.3.0_3.0_1674222981909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ft_news","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ft_news","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_ft_news|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|458.6 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/roberta_FT_newsqa
---
layout: model
title: English asr_wav2vec2_cetuc_sid_voxforge_mls_0 TFWav2Vec2ForCTC from joaoalvarenga
author: John Snow Labs
name: pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_cetuc_sid_voxforge_mls_0` is a English model originally trained by joaoalvarenga.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0_en_4.2.0_3.0_1664022807196.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0_en_4.2.0_3.0_1664022807196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Part of Speech for Chinese
author: John Snow Labs
name: pos_ctb9
date: 2021-01-03
task: Part of Speech Tagging
language: zh
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [pos, zh, cn, open_source]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PN (pronoun), CC (coordinating conjunction), and 39 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
## Predicted Entities
`AD`, `AS`, `BA`, `CC`, `CD`, `CS`, `DEC`, `DEG`, `DER`, `DEV`, `DT`, `EM`, `ETC`, `FW`, `IC`, `IJ`, `JJ`, `LB`, `LC`, `M`, `MSP`, `MSP-2`, `NN`, `NN-SHORT`, `NOI`, `NR`, `NR-SHORT`, `NT`, `NT-SHORT`, `OD`, `ON`, `P`, `PN`, `PU`, `SB`, `SP`, `URL`, `VA`, `VC`, `VE`, and `VV`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ctb9_zh_2.7.0_2.4_1609696404134.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ctb9_zh_2.7.0_2.4_1609696404134.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")\
.setInputCols(["sentence"])\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_ctb9", zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, pos])
example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_ctb9", "zh")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos))
val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""然而,这样的处理也衍生了一些问题。"""]
pos_df = nlu.load('zh.pos.ctb9').predict(text, output_level='token')
pos_df
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_chexpert_pipeline", "en", "clinical/models")
pipeline.annotate("FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax . FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.")
```
```scala
val pipeline = new PretrainedPipeline("ner_chexpert_pipeline", "en", "clinical/models")
pipeline.annotate("FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax . FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.chexpert.pipeline").predict("""FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax . FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.""")
```
## Results
```bash
| | chunk | label |
|---:|:-------------------------|:--------|
| 0 | endotracheal tube | OBS |
| 1 | Swan - Ganz catheter | OBS |
| 2 | left chest | ANAT |
| 3 | tube | OBS |
| 4 | in place | OBS |
| 5 | pneumothorax | OBS |
| 6 | Mild atelectatic changes | OBS |
| 7 | left base | ANAT |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_chexpert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Spanish RobertaForQuestionAnswering Large Cased model (from BSC-TeMU)
author: John Snow Labs
name: roberta_qa_bsc_temu_large_bne_s_c
date: 2022-12-02
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-bne-sqac` is a Spanish model originally trained by `BSC-TeMU`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_bsc_temu_large_bne_s_c_es_4.2.4_3.0_1669987001377.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_bsc_temu_large_bne_s_c_es_4.2.4_3.0_1669987001377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_bsc_temu_large_bne_s_c","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_bsc_temu_large_bne_s_c","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_bsc_temu_large_bne_s_c|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/BSC-TeMU/roberta-large-bne-sqac
- https://arxiv.org/abs/1907.11692
- http://www.bne.es/en/Inicio/index.html
- https://github.com/PlanTL-SANIDAD/lm-spanish
- https://arxiv.org/abs/2107.07253
---
layout: model
title: Bangla BertForQuestionAnswering model (from sagorsarker)
author: John Snow Labs
name: bert_qa_mbert_bengali_tydiqa_qa
date: 2022-06-02
tags: [bn, open_source, question_answering, bert]
task: Question Answering
language: bn
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mbert-bengali-tydiqa-qa` is a Bangla model orginally trained by `sagorsarker`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_bengali_tydiqa_qa_bn_4.0.0_3.0_1654188244734.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_bengali_tydiqa_qa_bn_4.0.0_3.0_1654188244734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_bengali_tydiqa_qa","bn") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_mbert_bengali_tydiqa_qa","bn")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("bn.answer_question.tydiqa.multi_lingual_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_mbert_bengali_tydiqa_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|bn|
|Size:|626.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sagorsarker/mbert-bengali-tydiqa-qa
- https://github.com/sagorbrur
- https://github.com/sagorbrur/bntransformer
- https://github.com/google-research-datasets/tydiqa
- https://www.linkedin.com/in/sagor-sarker/
- https://www.kaggle.com/
---
layout: model
title: Malay (macrolanguage) BertForQuestionAnswering model (from zhufy)
author: John Snow Labs
name: bert_qa_squad_ms_bert_base
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: ms
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-ms-bert-base` is a Malay (macrolanguage) model orginally trained by `zhufy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_ms_bert_base_ms_4.0.0_3.0_1654192110158.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_ms_bert_base_ms_4.0.0_3.0_1654192110158.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_ms_bert_base","ms") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_squad_ms_bert_base","ms")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("ms.answer_question.squad.bert.ms_tuned.base.by_zhufy").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_squad_ms_bert_base|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|ms|
|Size:|412.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/zhufy/squad-ms-bert-base
- https://github.com/huseinzol05/malay-dataset/tree/master/question-answer/squad
---
layout: model
title: Russian Part of Speech Tagger (from KoichiYasuoka)
author: John Snow Labs
name: bert_pos_bert_base_russian_upos
date: 2022-05-09
tags: [bert, pos, part_of_speech, ru, open_source]
task: Part of Speech Tagging
language: ru
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-russian-upos` is a Russian model orginally trained by `KoichiYasuoka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_russian_upos_ru_3.4.2_3.0_1652091813748.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_russian_upos_ru_3.4.2_3.0_1652091813748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_russian_upos","ru") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_russian_upos","ru")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Я люблю Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_bert_base_russian_upos|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ru|
|Size:|665.1 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/KoichiYasuoka/bert-base-russian-upos
- https://universaldependencies.org/ru/
- https://universaldependencies.org/u/pos/
- https://github.com/KoichiYasuoka/esupar
---
layout: model
title: Word Embeddings for Hindi (hindi_cc_300d)
author: John Snow Labs
name: hindi_cc_300d
date: 2021-02-03
task: Embeddings
language: hi
edition: Spark NLP 2.7.2
spark_version: 2.4
tags: [embeddings, open_source, hi]
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained on Common Crawl and Wikipedia using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words.
These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hindi_cc_300d_hi_2.7.2_2.4_1612362695785.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hindi_cc_300d_hi_2.7.2_2.4_1612362695785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of a pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = WordEmbeddingsModel.pretrained("hindi_cc_300d", "hi") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("hi.embed").predict("""Put your text here.""")
```
## Results
```bash
The model gives 300 dimensional feature vector output per token.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|hindi_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.7.2+|
|License:|Open Source|
|Input Labels:|[document, token]|
|Output Labels:|[word_embeddings]|
|Language:|hi|
|Case sensitive:|false|
|Dimension:|300|
## Data Source
This model is imported from https://fasttext.cc/docs/en/crawl-vectors.html
---
layout: model
title: English asr_wav2vec2_thai_ASR TFWav2Vec2ForCTC from Rattana
author: John Snow Labs
name: pipeline_asr_wav2vec2_thai_ASR
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_thai_ASR` is a English model originally trained by Rattana.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_thai_ASR_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_thai_ASR_en_4.2.0_3.0_1664112707199.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_thai_ASR_en_4.2.0_3.0_1664112707199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_thai_ASR', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_thai_ASR", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_thai_ASR|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten TFWav2Vec2ForCTC from patrickvonplaten
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten` is a English model originally trained by patrickvonplaten.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten_en_4.2.0_3.0_1664114081296.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten_en_4.2.0_3.0_1664114081296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|349.3 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English asr_temp TFWav2Vec2ForCTC from ying-tina
author: John Snow Labs
name: pipeline_asr_temp
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_temp` is a English model originally trained by ying-tina.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_temp_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_temp_en_4.2.0_3.0_1664110837777.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_temp_en_4.2.0_3.0_1664110837777.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_temp', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_temp", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_temp|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English ElectraForQuestionAnswering model (from sultan)
author: John Snow Labs
name: electra_qa_BioM_Base_SQuAD2
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BioM-ELECTRA-Base-SQuAD2` is a English model originally trained by `sultan`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Base_SQuAD2_en_4.0.0_3.0_1655918898262.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Base_SQuAD2_en_4.0.0_3.0_1655918898262.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_BioM_Base_SQuAD2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_BioM_Base_SQuAD2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.electra.base.by_sultan").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_BioM_Base_SQuAD2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|403.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sultan/BioM-ELECTRA-Base-SQuAD2
- https://github.com/salrowili/BioM-Transformers
---
layout: model
title: NER Pipeline for Clinical Problems (reduced taxonomy) - Voice of the Patient
author: John Snow Labs
name: ner_vop_problem_reduced_pipeline
date: 2023-06-10
tags: [licensed, pipeline, ner, en, vop, problem]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline extracts mentions of clinical problems from health-related text in colloquial language. All problem entities are merged into one generic Problem class.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_pipeline_en_4.4.3_3.0_1686420051472.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_pipeline_en_4.4.3_3.0_1686420051472.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_vop_problem_reduced_pipeline", "en", "clinical/models")
pipeline.annotate("
I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.
")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_vop_problem_reduced_pipeline", "en", "clinical/models")
val result = pipeline.annotate("
I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.
")
```
## Results
```bash
| chunk | ner_label |
|:---------------------|:------------|
| pain | Problem |
| fatigue | Problem |
| rheumatoid arthritis | Problem |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_problem_reduced_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|791.6 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: NER Model for 10 High Resourced Languages
author: John Snow Labs
name: xlm_roberta_large_token_classifier_hrl
date: 2021-12-26
tags: [arabic, german, english, spanish, french, italian, latvian, dutch, portuguese, chinese, xlm, roberta, ner, xx, open_source]
task: Named Entity Recognition
language: xx
edition: Spark NLP 3.3.4
spark_version: 2.4
supported: true
recommended: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model was imported from `Hugging Face` and it's been fine-tuned for 10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese), leveraging `XLM-RoBERTa` embeddings and `XlmRobertaForTokenClassification` for NER purposes.
## Predicted Entities
`ORG`, `PER`, `LOC`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_HRL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_HRL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_hrl_xx_3.3.4_2.4_1640520352673.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_hrl_xx_3.3.4_2.4_1640520352673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_hrl", "xx"))\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = """يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض."""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_hrl", "xx"))
.setInputCols(Array("sentence","token"))
.setOutputCol("ner")
ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val example = Seq.empty["يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض."].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.ner.high_resourced_lang").predict("""يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.""")
```
## Results
```bash
+---------------------------+---------+
|chunk |ner_label|
+---------------------------+---------+
|الرياض |LOC |
|فيصل بن بندر بن عبد العزيز |PER |
|الرياض |LOC |
+---------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_large_token_classifier_hrl|
|Compatibility:|Spark NLP 3.3.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|1.8 GB|
|Case sensitive:|true|
|Max sentense length:|256|
## Data Source
[https://huggingface.co/Davlan/xlm-roberta-large-ner-hrl](https://huggingface.co/Davlan/xlm-roberta-large-ner-hrl)
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4CHEMD_Chem_Original_SciBERT_384
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Original-SciBERT-384` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_SciBERT_384_en_4.0.0_3.0_1657108742652.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_SciBERT_384_en_4.0.0_3.0_1657108742652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_SciBERT_384","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_SciBERT_384","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4CHEMD_Chem_Original_SciBERT_384|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|410.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Original-SciBERT-384
---
layout: model
title: Legal Agreement and Plan of Reorganization Document Classifier (Longformer)
author: John Snow Labs
name: legclf_agreement_and_plan_of_reorganization
date: 2022-12-06
tags: [en, legal, classification, agreement, reorganization, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_agreement_and_plan_of_reorganization` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `agreement-and-plan-of-reorganization` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`agreement-and-plan-of-reorganization`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_plan_of_reorganization_en_1.0.0_3.0_1670357495950.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_plan_of_reorganization_en_1.0.0_3.0_1670357495950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[agreement-and-plan-of-reorganization]|
|[other]|
|[other]|
|[agreement-and-plan-of-reorganization]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_agreement_and_plan_of_reorganization|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.2 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
agreement-and-plan-of-reorganization 1.00 0.92 0.96 52
other 0.97 1.00 0.98 111
accuracy - - 0.98 163
macro-avg 0.98 0.96 0.97 163
weighted-avg 0.98 0.98 0.98 163
```
---
layout: model
title: English Electra Embeddings (from google)
author: John Snow Labs
name: electra_embeddings_electra_large_generator
date: 2022-05-17
tags: [en, open_source, electra, embeddings]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-large-generator` is a English model orginally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_large_generator_en_3.4.4_3.0_1652786652489.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_large_generator_en_3.4.4_3.0_1652786652489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_large_generator","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_large_generator","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_electra_large_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|192.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/google/electra-large-generator
- https://arxiv.org/pdf/1406.2661.pdf
- https://rajpurkar.github.io/SQuAD-explorer/
- https://openreview.net/pdf?id=r1xMH1BtvB
- https://gluebenchmark.com/
- https://rajpurkar.github.io/SQuAD-explorer/
- https://www.clips.uantwerpen.be/conll2000/chunking/
---
layout: model
title: English DistilBertForQuestionAnswering model (from aszidon) Custom Version-4
author: John Snow Labs
name: distilbert_qa_custom4
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom4` is a English model originally trained by `aszidon`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom4_en_4.0.0_3.0_1654728016252.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom4_en_4.0.0_3.0_1654728016252.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom4","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom4","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.custom4.by_aszidon").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_custom4|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aszidon/distilbertcustom4
---
layout: model
title: Entity Recognizer LG
author: John Snow Labs
name: entity_recognizer_lg
date: 2022-06-25
tags: [ru, open_source]
task: Named Entity Recognition
language: ru
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_ru_4.0.0_3.0_1656125353536.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_ru_4.0.0_3.0_1656125353536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("entity_recognizer_lg", "ru")
result = pipeline.annotate("""I love johnsnowlabs! """)
```
{:.nlu-block}
```python
import nlu
nlu.load("ru.ner.lg").predict("""I love johnsnowlabs! """)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|ru|
|Size:|2.5 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- NerDLModel
- NerConverter
---
layout: model
title: Legal Competition Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_competition_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, competition, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_competition_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Competition or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Competition`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_competition_bert_en_1.0.0_3.0_1678111605884.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_competition_bert_en_1.0.0_3.0_1678111605884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Competition]|
|[Other]|
|[Other]|
|[Competition]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_competition_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Competition 0.93 0.87 0.9 363
Other 0.87 0.92 0.9 333
accuracy - - 0.9 696
macro-avg 0.90 0.90 0.9 696
weighted-avg 0.90 0.90 0.9 696
```
---
layout: model
title: Pipeline to Detect Adverse Drug Events (MedicalBertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_ade_binary_pipeline
date: 2023-03-20
tags: [clinical, ade, licensed, public_health, token_classification, ner, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_ade_binary](https://nlp.johnsnowlabs.com/2022/07/27/bert_token_classifier_ner_ade_binary_en_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_binary_pipeline_en_4.3.0_3.2_1679299868936.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_binary_pipeline_en_4.3.0_3.2_1679299868936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_ade_binary_pipeline", "en", "clinical/models")
text = '''I used to be on paxil but that made me more depressed and prozac made me angry, Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_ade_binary_pipeline", "en", "clinical/models")
val text = "I used to be on paxil but that made me more depressed and prozac made me angry, Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:--------------|--------:|------:|:------------|-------------:|
| 0 | depressed | 44 | 52 | ADE | 0.990846 |
| 1 | angry | 73 | 77 | ADE | 0.972025 |
| 2 | sugar crashes | 147 | 159 | ADE | 0.933623 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_ade_binary_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: Portuguese asr_bp_tedx100_xlsr TFWav2Vec2ForCTC from lgris
author: John Snow Labs
name: asr_bp_tedx100_xlsr
date: 2022-09-26
tags: [wav2vec2, pt, audio, open_source, asr]
task: Automatic Speech Recognition
language: pt
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp_tedx100_xlsr` is a Portuguese model originally trained by lgris.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp_tedx100_xlsr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp_tedx100_xlsr_pt_4.2.0_3.0_1664192496842.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp_tedx100_xlsr_pt_4.2.0_3.0_1664192496842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_bp_tedx100_xlsr", "pt")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_bp_tedx100_xlsr", "pt")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_bp_tedx100_xlsr|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|pt|
|Size:|756.0 MB|
---
layout: model
title: Detect Pathogen, Medical Condition and Medicine
author: John Snow Labs
name: ner_pathogen
date: 2022-06-28
tags: [licensed, clinical, en, ner, pathogen, medical_condition, medicine]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained named entity recognition (NER) model is a deep learning model for detecting medical conditions (influenza, headache, malaria, etc), medicine (aspirin, penicillin, methotrexate) and pathogens (Corona Virus, Zika Virus, E. Coli, etc) in clinical texts. It is trained by using `MedicalNerApproach` annotator that allows to train generic NER models based on Neural Networks.
## Predicted Entities
`Pathogen`, `MedicalCondition`, `Medicine`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_pathogen_en_4.0.0_3.0_1656419618392.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_pathogen_en_4.0.0_3.0_1656419618392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl_healthcare", "en", 'clinical/models') \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_pathogen", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner_model,
ner_converter])
data = spark.createDataFrame([["""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_pathogen", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_converter))
val data = Seq("""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.pathogen").predict("""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.""")
```
## Results
```bash
+---------------+----------------+
|chunk |ner_label |
+---------------+----------------+
|Racecadotril |Medicine |
|loperamide |Medicine |
|Diarrhea |MedicalCondition|
|dehydration |MedicalCondition|
|skin color |MedicalCondition|
|fast heart rate|MedicalCondition|
|rabies virus |Pathogen |
|Lyssavirus |Pathogen |
|Ephemerovirus |Pathogen |
+---------------+----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_pathogen|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|14.6 MB|
## References
Trained on [dataset](https://www.kaggle.com/datasets/finalepoch/medical-ner) to get a model for Named Entity Recognition.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Pathogen 15.0 3.0 9.0 24.0 0.8333 0.625 0.7143
Medicine 15.0 2.0 0.0 15.0 0.8824 1.0 0.9375
MedicalCondition 53.0 2.0 6.0 59.0 0.9636 0.8983 0.9298
macro - - - - - - 0.8605
micro - - - - - - 0.8782
```
---
layout: model
title: Russian BertForQuestionAnswering Cased model (from ruselkomp)
author: John Snow Labs
name: bert_qa_deep_pavlov_full
date: 2022-07-07
tags: [ru, open_source, bert, question_answering]
task: Question Answering
language: ru
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deep-pavlov-full` is a Russian model originally trained by `ruselkomp`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_deep_pavlov_full_ru_4.0.0_3.0_1657189256159.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_deep_pavlov_full_ru_4.0.0_3.0_1657189256159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_deep_pavlov_full","ru") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["Как меня зовут?", "Меня зовут Клара, и я живу в Беркли."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_deep_pavlov_full","ru")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("Как меня зовут?", "Меня зовут Клара, и я живу в Беркли.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_deep_pavlov_full|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|ru|
|Size:|665.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ruselkomp/deep-pavlov-full
---
layout: model
title: Pipeline to Detect Living Species
author: John Snow Labs
name: ner_living_species_biobert_pipeline
date: 2023-03-20
tags: [ner, en, clinical, licensed, biobert]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_living_species_biobert](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_biobert_en_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_biobert_pipeline_en_4.3.0_3.2_1679309343209.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_biobert_pipeline_en_4.3.0_3.2_1679309343209.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_living_species_biobert_pipeline", "en", "clinical/models")
text = '''42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_living_species_biobert_pipeline", "en", "clinical/models")
val text = "42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------------|--------:|------:|:------------|-------------:|
| 0 | woman | 12 | 16 | HUMAN | 0.9999 |
| 1 | bacterial | 145 | 153 | SPECIES | 0.9981 |
| 2 | Fusarium spp | 337 | 348 | SPECIES | 0.9873 |
| 3 | patient | 355 | 361 | HUMAN | 0.9991 |
| 4 | species | 507 | 513 | SPECIES | 0.9926 |
| 5 | Fusarium solani complex | 522 | 544 | SPECIES | 0.8422 |
| 6 | antifungals | 792 | 802 | SPECIES | 0.9929 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_living_species_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.3 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from iis2009002)
author: John Snow Labs
name: xlmroberta_ner_iis2009002_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `iis2009002`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_iis2009002_base_finetuned_panx_de_4.1.0_3.0_1660433851832.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_iis2009002_base_finetuned_panx_de_4.1.0_3.0_1660433851832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_iis2009002_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_iis2009002_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_iis2009002_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/iis2009002/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: English BertForQuestionAnswering Cased model (from yossra)
author: John Snow Labs
name: bert_qa_yossra_finetuned_squad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `yossra`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_yossra_finetuned_squad_en_4.0.0_3.0_1657186848644.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_yossra_finetuned_squad_en_4.0.0_3.0_1657186848644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_yossra_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_yossra_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_yossra_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/yossra/bert-finetuned-squad
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab40 TFWav2Vec2ForCTC from hassnain
author: John Snow Labs
name: asr_wav2vec2_base_timit_demo_colab40
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab40` is a English model originally trained by hassnain.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab40_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab40_en_4.2.0_3.0_1664020850617.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab40_en_4.2.0_3.0_1664020850617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_timit_demo_colab40", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_timit_demo_colab40", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_timit_demo_colab40|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: Finnish asr_wav2vec2_xlsr_train_aug_bigLM_1B TFWav2Vec2ForCTC from RASMUS
author: John Snow Labs
name: pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B
date: 2022-09-25
tags: [wav2vec2, fi, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_train_aug_bigLM_1B` is a Finnish model originally trained by RASMUS.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B_fi_4.2.0_3.0_1664097620804.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B_fi_4.2.0_3.0_1664097620804.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B', lang = 'fi')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B", lang = "fi")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|3.6 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Parties Clause Binary Classifier
author: John Snow Labs
name: legclf_parties_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `parties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `parties`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_parties_clause_en_1.0.0_3.2_1660123811373.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_parties_clause_en_1.0.0_3.2_1660123811373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[parties]|
|[other]|
|[other]|
|[parties]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_parties_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.95 0.97 0.96 91
parties 0.90 0.84 0.87 32
accuracy - - 0.93 123
macro-avg 0.92 0.91 0.91 123
weighted-avg 0.93 0.93 0.93 123
```
---
layout: model
title: English LongformerForQuestionAnswering model (from Nomi97)
author: John Snow Labs
name: longformer_qa_Chatbot
date: 2022-06-26
tags: [en, open_source, longformer, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: LongformerForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Chatbot_QA` is a English model originally trained by `Nomi97`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_qa_Chatbot_en_4.0.0_3.0_1656255131812.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_qa_Chatbot_en_4.0.0_3.0_1656255131812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_Chatbot","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_Chatbot","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.longformer.by_Nomi97").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|longformer_qa_Chatbot|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|546.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Nomi97/Chatbot_QA
---
layout: model
title: Part of Speech for Arabic
author: John Snow Labs
name: pos_ud_padt
date: 2021-03-09
tags: [part_of_speech, open_source, arabic, pos_ud_padt, ar]
task: Part of Speech Tagging
language: ar
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- X
- VERB
- NOUN
- ADJ
- ADP
- PUNCT
- NUM
- None
- PRON
- SCONJ
- CCONJ
- DET
- PART
- ADV
- SYM
- AUX
- PROPN
- INTJ
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_padt_ar_3.0.0_3.0_1615292251530.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_padt_ar_3.0.0_3.0_1615292251530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
pos_tagger = PerceptronModel.pretrained("pos_ud_padt", "ar") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
pos_tagger
])
example = spark.createDataFrame([['مرحبا من جون سنو مختبرات! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val posTagger = PerceptronModel.pretrained("pos_ud_padt", "ar")
.setInputCols("sentence", "token")
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, posTagger))
val data = Seq("مرحبا من جون سنو مختبرات! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""مرحبا من جون سنو مختبرات! ""]
token_df = nlu.load('ar.pos').predict(text)
token_df
```
## Results
```bash
token pos
0 مرحبا NOUN
1 من ADP
2 جون X
3 سنو X
4 مختبرات NOUN
5 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_padt|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|ar|
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_bert_small_finetuned_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-small-finetuned-squad` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_finetuned_squad_en_4.0.0_3.0_1654184762850.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_finetuned_squad_en_4.0.0_3.0_1654184762850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_small_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_small_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.small.by_anas-awadalla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_small_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|107.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-small-finetuned-squad
---
layout: model
title: Translate Artificial languages to English Pipeline
author: John Snow Labs
name: translate_art_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, art, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `art`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_art_en_xx_2.7.0_2.4_1609686264952.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_art_en_xx_2.7.0_2.4_1609686264952.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_art_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_art_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.art.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_art_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from mrm8488)
author: John Snow Labs
name: t5_base_finetuned_squadv2
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-squadv2` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_squadv2_en_4.3.0_3.0_1675109106503.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_squadv2_en_4.3.0_3.0_1675109106503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_base_finetuned_squadv2","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_base_finetuned_squadv2","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_base_finetuned_squadv2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|861.2 MB|
## References
- https://huggingface.co/mrm8488/t5-base-finetuned-squadv2
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://rajpurkar.github.io/SQuAD-explorer/
- https://arxiv.org/pdf/1910.10683.pdf
- https://i.imgur.com/jVFMMWR.png
- https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb
- https://twitter.com/mrm8488
- https://www.linkedin.com/in/manuel-romero-cs/
---
layout: model
title: Spanish BertForMaskedLM Base Cased model (from dccuchile)
author: John Snow Labs
name: bert_embeddings_base_spanish_wwm_cased
date: 2022-12-02
tags: [es, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: es
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased` is a Spanish model originally trained by `dccuchile`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_spanish_wwm_cased_es_4.2.4_3.0_1670018860888.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_spanish_wwm_cased_es_4.2.4_3.0_1670018860888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_spanish_wwm_cased","es") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_spanish_wwm_cased","es")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_spanish_wwm_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|es|
|Size:|412.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased
- https://github.com/google-research/bert
- https://github.com/josecannete/spanish-corpora
- https://github.com/google-research/bert/blob/master/multilingual.md
- https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/tensorflow_weights.tar.gz
- https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/pytorch_weights.tar.gz
- https://users.dcc.uchile.cl/~jperez/beto/cased_2M/tensorflow_weights.tar.gz
- https://users.dcc.uchile.cl/~jperez/beto/cased_2M/pytorch_weights.tar.gz
- https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1827
- https://www.kaggle.com/nltkdata/conll-corpora
- https://github.com/gchaperon/beto-benchmarks/blob/master/conll2002/dev_results_beto-cased_conll2002.txt
- https://github.com/facebookresearch/MLDoc
- https://github.com/gchaperon/beto-benchmarks/blob/master/MLDoc/dev_results_beto-cased_mldoc.txt
- https://github.com/gchaperon/beto-benchmarks/blob/master/MLDoc/dev_results_beto-uncased_mldoc.txt
- https://github.com/google-research-datasets/paws/tree/master/pawsx
- https://github.com/facebookresearch/XNLI
- https://colab.research.google.com/drive/1uRwg4UmPgYIqGYY4gW_Nsw9782GFJbPt
- https://www.adere.so/
- https://imfd.cl/en/
- https://www.tensorflow.org/tfrc
- https://users.dcc.uchile.cl/~jperez/papers/pml4dc2020.pdf
- https://github.com/google-research/bert/blob/master/multilingual.md
- https://arxiv.org/pdf/1904.09077.pdf
- https://arxiv.org/pdf/1906.01502.pdf
- https://arxiv.org/abs/1812.10464
- https://arxiv.org/pdf/1901.07291.pdf
- https://arxiv.org/pdf/1904.02099.pdf
- https://arxiv.org/pdf/1906.01569.pdf
- https://arxiv.org/abs/1908.11828
---
layout: model
title: Detect Normalized Genes and Human Phenotypes
author: John Snow Labs
name: ner_human_phenotype_go_clinical
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model can be used to detect normalized mentions of genes (go) and human phenotypes (hp) in medical text.
## Predicted Entities
: `GO`, `HP`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HUMAN_PHENOTYPE_GO_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_clinical_en_3.0.0_3.0_1617209694955.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_clinical_en_3.0.0_3.0_1617209694955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_go_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("entities")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.")
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_human_phenotype_go_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("entities")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.human_phenotype.go_clinical").predict("""Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.""")
```
## Results
```bash
+----+--------------------------+---------+-------+----------+
| | chunk | begin | end | entity |
+====+==========================+=========+=======+==========+
| 0 | tumor | 39 | 43 | HP |
+----+--------------------------+---------+-------+----------+
| 1 | tricarboxylic acid cycle | 79 | 102 | GO |
+----+--------------------------+---------+-------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_human_phenotype_go_clinical|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|--------------:|------:|-----:|-----:|---------:|---------:|---------:|
| 0 | B-GO | 1530 | 129 | 57 | 0.922242 | 0.964083 | 0.942699 |
| 1 | B-HP | 950 | 133 | 130 | 0.877193 | 0.87963 | 0.87841 |
| 2 | I-HP | 253 | 46 | 68 | 0.846154 | 0.788162 | 0.816129 |
| 3 | I-GO | 4550 | 344 | 154 | 0.92971 | 0.967262 | 0.948114 |
| 4 | Macro-average | 7283 | 652 | 409 | 0.893825 | 0.899784 | 0.896795 |
| 5 | Micro-average | 7283 | 652 | 409 | 0.917832 | 0.946828 | 0.932105 |
```
---
layout: model
title: English asr_wav2vec_large_xlsr_korean TFWav2Vec2ForCTC from fleek
author: John Snow Labs
name: pipeline_asr_wav2vec_large_xlsr_korean
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec_large_xlsr_korean` is a English model originally trained by fleek.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec_large_xlsr_korean_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec_large_xlsr_korean_en_4.2.0_3.0_1664098611361.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec_large_xlsr_korean_en_4.2.0_3.0_1664098611361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec_large_xlsr_korean', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec_large_xlsr_korean", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec_large_xlsr_korean|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Fast Neural Machine Translation Model from Turkic Languages to English
author: John Snow Labs
name: opus_mt_trk_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, trk, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `trk`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_trk_en_xx_2.7.0_2.4_1609167597007.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_trk_en_xx_2.7.0_2.4_1609167597007.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_trk_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_trk_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.trk.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_trk_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from shaojie)
author: John Snow Labs
name: distilbert_qa_shaojie_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `shaojie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_shaojie_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772539534.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_shaojie_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772539534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shaojie_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shaojie_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_shaojie_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/shaojie/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from lorenzkuhn)
author: John Snow Labs
name: roberta_qa_lorenzkuhn_base_finetuned_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `lorenzkuhn`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_lorenzkuhn_base_finetuned_squad_en_4.3.0_3.0_1674217419784.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_lorenzkuhn_base_finetuned_squad_en_4.3.0_3.0_1674217419784.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_lorenzkuhn_base_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_lorenzkuhn_base_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_lorenzkuhn_base_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.2 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/lorenzkuhn/roberta-base-finetuned-squad
---
layout: model
title: Summarize Radiology Reports
author: John Snow Labs
name: summarizer_radiology
date: 2023-04-23
tags: [clinical, licensed, en, summarization, tensorflow, radiology]
task: Summarization
language: en
edition: Healthcare NLP 4.4.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MedicalSummarizer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is capable of summarizing radiology reports while preserving the important information such as imaging tests and findings.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_radiology_en_4.4.0_3.0_1682218525772.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_radiology_en_4.4.0_3.0_1682218525772.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
summarizer = MedicalSummarizer.pretrained("summarizer_radiology", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("summary")\
.setMaxTextLength(512)\
.setMaxNewTokens(512)
pipeline = Pipeline(stages=[
document,
summarizer
])
text = """INDICATIONS: Peripheral vascular disease with claudication.
RIGHT:
1. Normal arterial imaging of right lower extremity.
2. Peak systolic velocity is normal.
3. Arterial waveform is triphasic.
4. Ankle brachial index is 0.96.
LEFT:
1. Normal arterial imaging of left lower extremity.
2. Peak systolic velocity is normal.
3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic.
4. Ankle brachial index is 1.06.
IMPRESSION:
Normal arterial imaging of both lower lobes.
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val summarizer = MedicalSummarizer.pretrained("summarizer_radiology", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("summary")
.setMaxTextLength(512)
.setMaxNewTokens(512)
val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer))
val text = """INDICATIONS: Peripheral vascular disease with claudication.
RIGHT:
1. Normal arterial imaging of right lower extremity.
2. Peak systolic velocity is normal.
3. Arterial waveform is triphasic.
4. Ankle brachial index is 0.96.
LEFT:
1. Normal arterial imaging of left lower extremity.
2. Peak systolic velocity is normal.
3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic.
4. Ankle brachial index is 1.06.
IMPRESSION:
Normal arterial imaging of both lower lobes.
"""
val data = Seq(text).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
The patient has peripheral vascular disease with claudication. The right lower extremity shows normal arterial imaging, but the peak systolic velocity is normal. The arterial waveform is triphasic throughout, except for the posterior tibial artery, which is biphasic. The ankle brachial index is 0.96. The impression is normal arterial imaging of both lower lobes.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|summarizer_radiology|
|Compatibility:|Healthcare NLP 4.4.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|920.4 MB|
---
layout: model
title: English BertForQuestionAnswering model (from AnonymousSub)
author: John Snow Labs
name: bert_qa_news_pretrain_bert_FT_new_newsqa
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `news_pretrain_bert_FT_new_newsqa` is a English model orginally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_news_pretrain_bert_FT_new_newsqa_en_4.0.0_3.0_1654188909680.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_news_pretrain_bert_FT_new_newsqa_en_4.0.0_3.0_1654188909680.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_news_pretrain_bert_FT_new_newsqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_news_pretrain_bert_FT_new_newsqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.news.bert.new.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_news_pretrain_bert_FT_new_newsqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/news_pretrain_bert_FT_new_newsqa
---
layout: model
title: Modern Greek (1453-) asr_greek_lsr_1 TFWav2Vec2ForCTC from skylord
author: John Snow Labs
name: asr_greek_lsr_1
date: 2022-09-25
tags: [wav2vec2, el, audio, open_source, asr]
task: Automatic Speech Recognition
language: el
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_greek_lsr_1` is a Modern Greek (1453-) model originally trained by skylord.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_greek_lsr_1_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_greek_lsr_1_el_4.2.0_3.0_1664110738611.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_greek_lsr_1_el_4.2.0_3.0_1664110738611.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_greek_lsr_1", "el")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_greek_lsr_1", "el")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_greek_lsr_1|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|el|
|Size:|1.2 GB|
---
layout: model
title: English BertForQuestionAnswering model (from healx)
author: John Snow Labs
name: bert_qa_biomedical_slot_filling_reader_base
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biomedical-slot-filling-reader-base` is a English model orginally trained by `healx`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biomedical_slot_filling_reader_base_en_4.0.0_3.0_1654185786674.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biomedical_slot_filling_reader_base_en_4.0.0_3.0_1654185786674.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biomedical_slot_filling_reader_base","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_biomedical_slot_filling_reader_base","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bio_medical.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_biomedical_slot_filling_reader_base|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/healx/biomedical-slot-filling-reader-base
- https://arxiv.org/abs/2109.08564
---
layout: model
title: Legal Dissolution Clause Binary Classifier
author: John Snow Labs
name: legclf_dissolution_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `dissolution` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `dissolution`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dissolution_clause_en_1.0.0_3.2_1660122374357.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dissolution_clause_en_1.0.0_3.2_1660122374357.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[dissolution]|
|[other]|
|[other]|
|[dissolution]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_dissolution_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
dissolution 0.95 0.90 0.92 39
other 0.97 0.99 0.98 137
accuracy - - 0.97 176
macro-avg 0.96 0.94 0.95 176
weighted-avg 0.97 0.97 0.97 176
```
---
layout: model
title: Clinical Portuguese Bert Embeddiongs (Biomedical)
author: John Snow Labs
name: biobert_embeddings_biomedical
date: 2022-04-11
tags: [biobert, embeddings, pt, open_source]
task: Embeddings
language: pt
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BioBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `biobertpt-bio` is a Portuguese model orginally trained by `pucpr`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_embeddings_biomedical_pt_3.4.2_3.0_1649687586887.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_embeddings_biomedical_pt_3.4.2_3.0_1649687586887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("biobert_embeddings_biomedical","pt") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Odeio o cancro"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("biobert_embeddings_biomedical","pt")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Odeio o cancro").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("pt.embed.gs_biomedical").predict("""Odeio o cancro""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|biobert_embeddings_biomedical|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|pt|
|Size:|667.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/pucpr/biobertpt-bio
- https://aclanthology.org/2020.clinicalnlp-1.7/
- https://github.com/HAILab-PUCPR/BioBERTpt
---
layout: model
title: Adverse Drug Events Classifier (BERT)
author: John Snow Labs
name: bert_sequence_classifier_ade
date: 2022-02-08
tags: [bert, sequence_classification, en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Classify texts/sentences in two categories:
- `True` : The sentence is talking about a possible ADE.
- `False` : The sentence doesn’t have any information about an ADE.
This model is a [BioBERT-based](https://github.com/dmis-lab/biobert) classifier.
## Predicted Entities
`True`, `False`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_ADE/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/08.3.MedicalBertForSequenceClassification_in_SparkNLP.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_ade_en_3.4.1_3.0_1644324436716.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_ade_en_3.4.1_3.0_1644324436716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame([["I felt a bit drowsy and had blurred vision after taking Aspirin."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq("I felt a bit drowsy and had blurred vision after taking Aspirin.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.ade.seq_biobert").predict("""I felt a bit drowsy and had blurred vision after taking Aspirin.""")
```
## Results
```bash
+----------------------------------------------------------------+------+
|text |result|
+----------------------------------------------------------------+------+
|I felt a bit drowsy and had blurred vision after taking Aspirin.|[True]|
+----------------------------------------------------------------+------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_ade|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.0 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
This model is trained on a custom dataset comprising of CADEC, DRUG-AE and Twimed.
## Benchmarking
```bash
label precision recall f1-score support
False 0.97 0.97 0.97 6884
True 0.87 0.85 0.86 1398
accuracy 0.95 0.95 0.95 8282
macro-avg 0.92 0.91 0.91 8282
weighted-avg 0.95 0.95 0.95 8282
```
---
layout: model
title: English RobertaForQuestionAnswering (from AyushPJ)
author: John Snow Labs
name: roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ai-club-inductions-21-nlp-roBERTa-base-squad-v2` is a English model originally trained by `AyushPJ`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2_en_4.0.0_3.0_1655727560873.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2_en_4.0.0_3.0_1655727560873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base_v2.by_AyushPJ").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|465.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AyushPJ/ai-club-inductions-21-nlp-roBERTa-base-squad-v2
---
layout: model
title: Legal Listing Clause Binary Classifier
author: John Snow Labs
name: legclf_listing_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `listing` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `listing`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_listing_clause_en_1.0.0_3.2_1660123698313.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_listing_clause_en_1.0.0_3.2_1660123698313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[listing]|
|[other]|
|[other]|
|[listing]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_listing_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
listing 1.00 0.97 0.99 38
other 0.99 1.00 0.99 92
accuracy - - 0.99 130
macro-avg 0.99 0.99 0.99 130
weighted-avg 0.99 0.99 0.99 130
```
---
layout: model
title: English Named Entity Recognition (from lucifermorninstar011)
author: John Snow Labs
name: distilbert_ner_autotrain_lucifer_morningstar_job_859227344
date: 2022-05-16
tags: [distilbert, ner, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-lucifer_morningstar_job-859227344` is a English model orginally trained by `lucifermorninstar011`.
## Predicted Entities
`Job`, `OOV`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_lucifer_morningstar_job_859227344_en_3.4.2_3.0_1652721635851.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_lucifer_morningstar_job_859227344_en_3.4.2_3.0_1652721635851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_lucifer_morningstar_job_859227344","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_lucifer_morningstar_job_859227344","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_ner_autotrain_lucifer_morningstar_job_859227344|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/lucifermorninstar011/autotrain-lucifer_morningstar_job-859227344
---
layout: model
title: Spanish DistilBertForQuestionAnswering model (from CenIA) MLQA
author: John Snow Labs
name: distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa
date: 2022-06-08
tags: [es, open_source, distilbert, question_answering]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distillbert-base-spanish-uncased-finetuned-qa-mlqa` is a English model originally trained by `CenIA`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa_es_4.0.0_3.0_1654728089558.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa_es_4.0.0_3.0_1654728089558.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa","es")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.mlqa.distil_bert.base_uncased").predict("""¿Cuál es mi nombre?|||"Mi nombre es Clara y vivo en Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|es|
|Size:|250.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/CenIA/distillbert-base-spanish-uncased-finetuned-qa-mlqa
---
layout: model
title: English Bert Embeddings Cased model (from aditeyabaral)
author: John Snow Labs
name: bert_embeddings_carlbert_webex_mlm_spatial
date: 2023-02-22
tags: [open_source, bert, bert_embeddings, bertformaskedlm, en, tensorflow]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `carlbert-webex-mlm-spatial` is a English model originally trained by `aditeyabaral`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_carlbert_webex_mlm_spatial_en_4.3.0_3.0_1677087512961.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_carlbert_webex_mlm_spatial_en_4.3.0_3.0_1677087512961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_carlbert_webex_mlm_spatial","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Spark-NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_carlbert_webex_mlm_spatial","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Spark-NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_carlbert_webex_mlm_spatial|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|406.6 MB|
|Case sensitive:|true|
## References
https://huggingface.co/aditeyabaral/carlbert-webex-mlm-spatial
---
layout: model
title: Stopwords Remover for Swedish language (386 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, sv, open_source]
task: Stop Words Removal
language: sv
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_sv_3.4.1_3.0_1646672982973.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_sv_3.4.1_3.0_1646672982973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","sv") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Du är inte bättre än jag"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","sv")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Du är inte bättre än jag").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("sv.stopwords").predict("""Du är inte bättre än jag""")
```
## Results
```bash
+------+
|result|
+------+
|[är] |
+------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|sv|
|Size:|2.5 KB|
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4CHEMD_Chem_Original_BioBERT_512
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Original-BioBERT-512` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_BioBERT_512_en_4.0.0_3.0_1657108546349.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_BioBERT_512_en_4.0.0_3.0_1657108546349.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_BioBERT_512","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_BioBERT_512","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4CHEMD_Chem_Original_BioBERT_512|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|403.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Original-BioBERT-512
---
layout: model
title: English image_classifier_vit_trainer_rare_puppers ViTForImageClassification from nateraw
author: John Snow Labs
name: image_classifier_vit_trainer_rare_puppers
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_trainer_rare_puppers` is a English model originally trained by nateraw.
## Predicted Entities
`corgi`, `samoyed`, `shiba inu`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_trainer_rare_puppers_en_4.1.0_3.0_1660169700721.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_trainer_rare_puppers_en_4.1.0_3.0_1660169700721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_trainer_rare_puppers", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_trainer_rare_puppers", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_trainer_rare_puppers|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Pipeline to Detect PHI (Deidentification)
author: John Snow Labs
name: ner_deid_large_pipeline
date: 2023-03-13
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_deid_large_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_pipeline_en_4.3.0_3.2_1678736100712.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_pipeline_en_4.3.0_3.2_1678736100712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_large_pipeline", "en", "clinical/models")
text = '''HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_deid_large_pipeline", "en", "clinical/models")
val text = "HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.deid_large.pipeline").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""")
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:----------------|--------:|------:|:------------|-------------:|
| 0 | Smith | 32 | 36 | NAME | 0.9998 |
| 1 | VA Hospital | 184 | 194 | LOCATION | 0.68335 |
| 2 | Day Hospital | 258 | 269 | LOCATION | 0.7763 |
| 3 | 02/04/2003 | 341 | 350 | DATE | 1 |
| 4 | Smith | 374 | 378 | NAME | 0.9993 |
| 5 | Day Hospital | 397 | 408 | LOCATION | 0.7522 |
| 6 | Smith | 782 | 786 | NAME | 0.9998 |
| 7 | Smith | 1131 | 1135 | NAME | 0.9997 |
| 8 | 7 Ardmore Tower | 1153 | 1167 | LOCATION | 0.739867 |
| 9 | Hart | 1221 | 1224 | NAME | 0.9995 |
| 10 | Smith | 1231 | 1235 | NAME | 0.9998 |
| 11 | 02/07/2003 | 1329 | 1338 | DATE | 1 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_large_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English RoBERTa Embeddings (Sampling strategy 'sim select')
author: John Snow Labs
name: roberta_embeddings_distilroberta_base_climate_s
date: 2022-04-14
tags: [roberta, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilroberta-base-climate-s` is a English model orginally trained by `climatebert`.
Sampling strategy s:As expressed in the author's paper [here](https://arxiv.org/pdf/2110.12010.pdf), s is "sim select", meaning 70% of the most similar sentences of one of the corpus was used, discarding the rest.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_s_en_3.4.2_3.0_1649946847931.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_s_en_3.4.2_3.0_1649946847931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_s","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_s","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_distilroberta_base_climate_s|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|310.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/climatebert/distilroberta-base-climate-s
- https://arxiv.org/abs/2110.12010
---
layout: model
title: Stop Words Cleaner for Latvian
author: John Snow Labs
name: stopwords_lv
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: lv
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, lv]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_lv_lv_2.5.4_2.4_1594742439893.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_lv_lv_2.5.4_2.4_1594742439893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_lv", "lv") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_lv", "lv")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis."""]
stopword_df = nlu.load('lv.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=4, result='Džons', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=6, end=10, result='Snovs', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=24, end=30, result='ziemeļu', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=32, end=38, result='karalis', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=39, end=39, result=',', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_lv|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|lv|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: French RobertaForQuestionAnswering (from Gantenbein)
author: John Snow Labs
name: roberta_qa_ADDI_FR_XLM_R
date: 2022-06-20
tags: [open_source, question_answering, roberta]
task: Question Answering
language: fr
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-FR-XLM-R` is a French model originally trained by `Gantenbein`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_FR_XLM_R_fr_4.0.0_3.0_1655726482123.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_FR_XLM_R_fr_4.0.0_3.0_1655726482123.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ADDI_FR_XLM_R","fr") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_ADDI_FR_XLM_R","fr")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("fr.answer_question.xlm_roberta.fr_tuned.by_Gantenbein").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_ADDI_FR_XLM_R|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|fr|
|Size:|422.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Gantenbein/ADDI-FR-XLM-R
---
layout: model
title: Relation Extraction Between Body Parts and Procedures
author: John Snow Labs
name: redl_bodypart_procedure_test_biobert
date: 2023-01-14
tags: [relation_extraction, en, clinical, dl, licensed, tensorflow]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Relation extraction between body parts entities like ‘Internal_organ_or_component’, ’External_body_part_or_region’ etc. and procedure and test entities. 1 : body part and test/procedure are related to each other. 0 : body part and test/procedure are not related to each other.
## Predicted Entities
`1`, `0`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_procedure_test_biobert_en_4.2.4_3.0_1673714088228.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_procedure_test_biobert_en_4.2.4_3.0_1673714088228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = NerConverterInternal() \
.setInputCols(["sentences", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentences", "pos_tags", "tokens"]) \
.setOutputCol("dependencies")
# Set a filter on pairs of named entities which will be treated as relation candidates
re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setMaxSyntacticDistance(10)\
.setOutputCol("re_ner_chunks")\
.setRelationPairs(["external_body_part_or_region-test"])
# The dataset this model is trained to is sentence-wise.
# This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
re_model = RelationExtractionDLModel()\
.pretrained('redl_bodypart_procedure_test_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model])
data = spark.createDataFrame([['''TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.''']]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentences"))
.setOutputCol("tokens")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
.setRelationPairs("external_body_part_or_region-test")
// The dataset this model is trained to is sentence-wise.
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val re_model = RelationExtractionDLModel()
.pretrained("redl_bodypart_procedure_test_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model))
val data = Seq("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.bodypart.procedure").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""")
```
## Results
```bash
| | relation | entity1 | chunk1 | entity2 | chunk2 | confidence |
|---:|-----------:|:-----------------------------|:---------|:----------|:--------------------|-------------:|
| 0 | 1 | External_body_part_or_region | chest | Test | portable ultrasound | 0.99953 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_bodypart_procedure_test_biobert|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|401.7 MB|
## References
Trained on a custom internal dataset.
## Benchmarking
```bash
label Recall Precision F1 Support
0 0.338 0.472 0.394 325
1 0.904 0.843 0.872 1275
Avg. 0.621 0.657 0.633 -
```
---
layout: model
title: Translate Morisyen to English Pipeline
author: John Snow Labs
name: translate_mfe_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, mfe, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `mfe`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mfe_en_xx_2.7.0_2.4_1609688095194.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mfe_en_xx_2.7.0_2.4_1609688095194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_mfe_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_mfe_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.mfe.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_mfe_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Multilingual XLMRobertaForTokenClassification Base Cased model (from haesun)
author: John Snow Labs
name: xlmroberta_ner_haesun_base_finetuned_panx_all
date: 2022-08-13
tags: [xx, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `haesun`.
## Predicted Entities
`ORG`, `LOC`, `PER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_haesun_base_finetuned_panx_all_xx_4.1.0_3.0_1660428351266.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_haesun_base_finetuned_panx_all_xx_4.1.0_3.0_1660428351266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_haesun_base_finetuned_panx_all","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_haesun_base_finetuned_panx_all","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_haesun_base_finetuned_panx_all|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|862.6 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/haesun/xlm-roberta-base-finetuned-panx-all
---
layout: model
title: Ocr pipeline with Rest-Api
author: John Snow Labs
name: ocr_restapi
date: 2023-01-03
tags: [en, licensed, ocr, RestApi]
task: Ocr RestApi
language: en
nav_key: models
edition: Visual NLP 4.0.0
spark_version: 3.2.1
supported: true
annotator: OcrRestApi
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
RestAPI pipeline implementation for the OCR task, using tesseract models. Tesseract is an Optical Character Recognition (OCR) engine developed by Google. It is an open-source tool that can be used to recognize text in images and convert it into machine-readable text. The engine is based on a neural network architecture and uses machine learning algorithms to improve its accuracy over time.
Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset.
In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy.
## Predicted Entities
{:.btn-box}
[Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/6.2.SparkOcrRestApi.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
## How to use
## Example
### Input:

## Output text
```bash
Response:
STARBUCKS Store #19208
11902 Euclid Avenue
Cleveland, OH (216) 229-U749
CHK 664250
12/07/2014 06:43 PM
112003. Drawers 2. Reg: 2
¥t Pep Mocha 4.5
Sbux Card 495
AMXARKERARANG 228
Subtotal $4.95
Total $4.95
Change Cue BO LOO
- Check Closed ~
"49/07/2014 06:43 py
oBUX Card «3228 New Balance: 37.45
Card is registertd
```
## Model Information
{:.table-model}
|---|---|
|Model Name:|ocr_restapi|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
---
layout: model
title: Pipeline to Extract Mentions of Response to Cancer Treatment
author: John Snow Labs
name: ner_oncology_response_to_treatment_pipeline
date: 2023-03-09
tags: [licensed, clinical, en, oncology, ner, treatment]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_oncology_response_to_treatment](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_response_to_treatment_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_pipeline_en_4.3.0_3.2_1678349824229.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_pipeline_en_4.3.0_3.2_1678349824229.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_oncology_response_to_treatment_pipeline", "en", "clinical/models")
text = '''She completed her first-line therapy, but some months later there was recurrence of the breast cancer.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_oncology_response_to_treatment_pipeline", "en", "clinical/models")
val text = "She completed her first-line therapy, but some months later there was recurrence of the breast cancer."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:-------------|--------:|------:|:----------------------|-------------:|
| 0 | recurrence | 70 | 79 | Response_To_Treatment | 0.9767 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_response_to_treatment_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from skandaonsolve)
author: John Snow Labs
name: roberta_qa_finetuned_timeentities
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-timeentities` is a English model originally trained by `skandaonsolve`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities_en_4.3.0_3.0_1674220613032.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities_en_4.3.0_3.0_1674220613032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_finetuned_timeentities|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/skandaonsolve/roberta-finetuned-timeentities
---
layout: model
title: English BertForQuestionAnswering model (from mrm8488)
author: John Snow Labs
name: bert_qa_bert_tiny_2_finetuned_squadv2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-2-finetuned-squadv2` is a English model orginally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_2_finetuned_squadv2_en_4.0.0_3.0_1654184806087.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_2_finetuned_squadv2_en_4.0.0_3.0_1654184806087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_tiny_2_finetuned_squadv2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_tiny_2_finetuned_squadv2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert.tiny_v2.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_tiny_2_finetuned_squadv2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|19.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/bert-tiny-2-finetuned-squadv2
---
layout: model
title: Part of Speech for Icelandic
author: John Snow Labs
name: pos_icepahc
date: 2021-03-23
tags: [pos, open_source, is]
supported: true
task: Part of Speech Tagging
language: is
edition: Spark NLP 2.7.5
spark_version: 2.4
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture.
## Predicted Entities
- ADJ
- ADP
- ADV
- AUX
- CCONJ
- DET
- NOUN
- NUM
- PART
- PRON
- PROPN
- PUNCT
- VERB
- X
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_icepahc_is_2.7.5_2.4_1616509019245.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_icepahc_is_2.7.5_2.4_1616509019245.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_icepahc", "is")\
.setInputCols(["document", "token"])\
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
posTagger
])
example = spark.createDataFrame([['Númerið blikkaði á skjánum eins og einmana vekjaraklukka um nótt á níundu hæð í gamalli blokk í austurbæ Reykjavíkur .']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_icepahc", "is")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, pos))
val data = Seq("Númerið blikkaði á skjánum eins og einmana vekjaraklukka um nótt á níundu hæð í gamalli blokk í austurbæ Reykjavíkur .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Númerið blikkaði á skjánum eins og einmana vekjaraklukka um nótt á níundu hæð í gamalli blokk í austurbæ Reykjavíkur .""]
token_df = nlu.load('is.pos.icepahc').predict(text)
token_df
```
## Results
```bash
+----------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
|text |result |
+----------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
|Númerið blikkaði á skjánum eins og einmana vekjaraklukka um nótt á níundu hæð í gamalli blokk í austurbæ Reykjavíkur .|[NOUN, VERB, ADP, NOUN, ADV, ADP, ADJ, NOUN, ADP, NOUN, ADP, ADJ, NOUN, ADP, ADJ, NOUN, ADP, PROPN, PROPN, PUNCT]|
+----------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_icepahc|
|Compatibility:|Spark NLP 2.7.5+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[pos]|
|Language:|is|
## Data Source
The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set.
## Benchmarking
```bash
| | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| ADJ | 0.81 | 0.74 | 0.78 | 5906 |
| ADP | 0.95 | 0.96 | 0.96 | 15548 |
| ADV | 0.90 | 0.90 | 0.90 | 10631 |
| AUX | 0.92 | 0.93 | 0.92 | 7416 |
| CCONJ | 0.96 | 0.97 | 0.96 | 8437 |
| DET | 0.89 | 0.87 | 0.88 | 7476 |
| INTJ | 0.95 | 0.77 | 0.85 | 131 |
| NOUN | 0.90 | 0.92 | 0.91 | 20726 |
| NUM | 0.75 | 0.83 | 0.79 | 655 |
| PART | 0.96 | 0.96 | 0.96 | 1703 |
| PRON | 0.94 | 0.96 | 0.95 | 16852 |
| PROPN | 0.89 | 0.89 | 0.89 | 4444 |
| PUNCT | 0.98 | 0.98 | 0.98 | 16434 |
| SCONJ | 0.94 | 0.94 | 0.94 | 5663 |
| VERB | 0.92 | 0.90 | 0.91 | 17329 |
| X | 0.60 | 0.30 | 0.40 | 346 |
| accuracy | | | 0.92 | 139697 |
| macro avg | 0.89 | 0.86 | 0.87 | 139697 |
| weighted avg | 0.92 | 0.92 | 0.92 | 139697 |
```
---
layout: model
title: Part of Speech for Marathi
author: John Snow Labs
name: pos_ud_ufal
date: 2021-03-09
tags: [part_of_speech, open_source, marathi, pos_ud_ufal, mr]
task: Part of Speech Tagging
language: mr
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- DET
- AUX
- NOUN
- PUNCT
- PRON
- ADJ
- CCONJ
- ADV
- VERB
- SCONJ
- NUM
- ADP
- INTJ
- PROPN
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ufal_mr_3.0.0_3.0_1615292224912.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ufal_mr_3.0.0_3.0_1615292224912.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_ufal", "mr") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['जॉन हिम लॅब्समधून हॅलो! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_ufal", "mr")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("जॉन हिम लॅब्समधून हॅलो! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""जॉन हिम लॅब्समधून हॅलो! ""]
token_df = nlu.load('mr.pos').predict(text)
token_df
```
## Results
```bash
token pos
0 जॉन PROPN
1 हिम NOUN
2 लॅब्समधून ADJ
3 हॅलो VERB
4 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_ufal|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|mr|
---
layout: model
title: Japanese Word Segmentation
author: John Snow Labs
name: wordseg_gsd_ud
date: 2021-01-03
task: Word Segmentation
language: ja
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [word_segmentation, ja, open_source]
supported: true
annotator: WordSegmenterModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.
References:
- Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_ja_2.7.0_2.4_1609692613721.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_ja_2.7.0_2.4_1609692613721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
word_segmenter = WordSegmenterModel.pretrained('wordseg_gsd_ud', 'ja')\
.setInputCols("document")\
.setOutputCol("token")
pipeline = Pipeline(stages=[
document_assembler,
word_segmenter
])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame([['清代は湖北省が置かれ、そのまま現代の行政区分になっている。']], ["text"])
result = model.transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")
.setInputCols("document")
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("清代は湖北省が置かれ、そのまま現代の行政区分になっている。").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""清代は湖北省が置かれ、そのまま現代の行政区分になっている。"""]
token_df = nlu.load('ja.segment_words').predict(text, output_level='token')
token_df
```
## Results
```bash
+----------------------------------------------------------+------------------------------------------------------------------------------------------------+
|text |result |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------+
|清代は湖北省が置かれ、そのまま現代の行政区分になっている。|[清代, は, 湖北, 省, が, 置か, れ, 、, その, まま, 現代, の, 行政, 区分, に, なっ, て, いる, 。]|
+----------------------------------------------------------+------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wordseg_gsd_ud|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[token]|
|Language:|ja|
## Data Source
We trained this model on the the [Universal Dependenicies](universaldependencies.org) data set from Google (GSD-UD).
> Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018.
## Benchmarking
```bash
| Model | precision | recall | f1-score |
|---------------|--------------|--------------|--------------|
| JA_UD-GSD | 0,7687 | 0,8048 | 0,7863 |
```
---
layout: model
title: English BertForTokenClassification Cased model (from test123)
author: John Snow Labs
name: bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765
date: 2022-11-30
tags: [en, open_source, bert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-ingredient_pseudo_label_training_ner-29576765` is a English model originally trained by `test123`.
## Predicted Entities
`I`, `B`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765_en_4.2.4_3.0_1669814198805.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765_en_4.2.4_3.0_1669814198805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/test123/autonlp-ingredient_pseudo_label_training_ner-29576765
---
layout: model
title: Legal NER for NDA (Confidential Information-Permissions)
author: John Snow Labs
name: legner_nda_confidential_information_permissions
date: 2023-04-06
tags: [en, licensed, legal, ner, nda, permission]
task: Named Entity Recognition
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a NER model, aimed to be run **only** after detecting the `USE_OF_CONF_INFO ` clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other for that purpose). It will extract the following entities: `PERMISSION`, `PERMISSION_SUBJECT `, `PERMISSION_OBJECT `, and `PERMISSION_IND_OBJECT `.
## Predicted Entities
`PERMISSION`, `PERMISSION_SUBJECT`, `PERMISSION_OBJECT`, `PERMISSION_IND_OBJECT`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_confidential_information_permissions_en_1.0.0_3.0_1680814300223.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_confidential_information_permissions_en_1.0.0_3.0_1680814300223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_nda_confidential_information_permissions", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""The interested party may disclose the information to its financing sources and potential financing sources provided that such financing sources are bound by the terms of this non-disclosure agreement and agree to keep the information confidential."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+---------------------------+---------------------+
|chunk |ner_label |
+---------------------------+---------------------+
|interested party |PERMISSION_SUBJECT |
|disclose |PERMISSION |
|information |PERMISSION_OBJECT |
|financing sources |PERMISSION_IND_OBJECT|
|potential financing sources|PERMISSION_IND_OBJECT|
+---------------------------+---------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_nda_confidential_information_permissions|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.3 MB|
## References
In-house annotations on the Non-disclosure Agreements
## Benchmarking
```bash
label precision recall f1-score support
PERMISSION 1.00 1.00 1.00 9
PERMISSION_IND_OBJECT 1.00 0.67 0.80 9
PERMISSION_OBJECT 0.91 1.00 0.95 10
PERMISSION_SUBJECT 0.90 1.00 0.95 9
micro-avg 0.94 0.92 0.93 37
macro-avg 0.95 0.92 0.92 37
weighted-avg 0.95 0.92 0.93 37
```
---
layout: model
title: English RobertaForQuestionAnswering (from nlpconnect)
author: John Snow Labs
name: roberta_qa_dpr_nq_reader_roberta_base_v2
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dpr-nq-reader-roberta-base-v2` is a English model originally trained by `nlpconnect`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_roberta_base_v2_en_4.0.0_3.0_1655728533632.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_roberta_base_v2_en_4.0.0_3.0_1655728533632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_roberta_base_v2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_dpr_nq_reader_roberta_base_v2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.roberta.base_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_dpr_nq_reader_roberta_base_v2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|466.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/nlpconnect/dpr-nq-reader-roberta-base-v2
---
layout: model
title: Translate Germanic languages to English Pipeline
author: John Snow Labs
name: translate_gem_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, gem, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `gem`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gem_en_xx_2.7.0_2.4_1609691653097.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gem_en_xx_2.7.0_2.4_1609691653097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_gem_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_gem_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.gem.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_gem_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForTokenClassification Base Uncased model (from Datasaur)
author: John Snow Labs
name: distilbert_token_classifier_base_uncased_finetuned_conll2003
date: 2023-03-14
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-conll2003` is a English model originally trained by `Datasaur`.
## Predicted Entities
`PER`, `ORG`, `MISC`, `LOC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll2003_en_4.3.1_3.0_1678783094748.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll2003_en_4.3.1_3.0_1678783094748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll2003","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll2003","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_base_uncased_finetuned_conll2003|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Datasaur/distilbert-base-uncased-finetuned-conll2003
---
layout: model
title: English asr_xlsr_punctuation TFWav2Vec2ForCTC from boris
author: John Snow Labs
name: pipeline_asr_xlsr_punctuation
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_punctuation` is a English model originally trained by boris.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xlsr_punctuation_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_punctuation_en_4.2.0_3.0_1664020787424.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_punctuation_en_4.2.0_3.0_1664020787424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_xlsr_punctuation', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_xlsr_punctuation", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_xlsr_punctuation|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Part of Speech for Portuguese
author: John Snow Labs
name: pos_ud_bosque
date: 2020-05-03 12:54:00 +0800
task: Part of Speech Tagging
language: pt
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [pos, pt]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bosque_pt_2.5.0_2.4_1588499443093.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bosque_pt_2.5.0_2.4_1588499443093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_bosque", "pt") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica.")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_bosque", "pt")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica."""]
pos_df = nlu.load('pt.pos.ud_bosque').predict(text, output_level='token')
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=3, result='ADV', metadata={'word': 'Além'}),
Row(annotatorType='pos', begin=5, end=6, result='ADP', metadata={'word': 'de'}),
Row(annotatorType='pos', begin=8, end=10, result='AUX', metadata={'word': 'ser'}),
Row(annotatorType='pos', begin=12, end=12, result='DET', metadata={'word': 'o'}),
Row(annotatorType='pos', begin=14, end=16, result='NOUN', metadata={'word': 'rei'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_bosque|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|pt|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: English RobertaForSequenceClassification Cased model (from lucianpopa)
author: John Snow Labs
name: roberta_classifier_autonlp_sst1_529214890
date: 2022-12-09
tags: [en, open_source, roberta, sequence_classification, classification]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-SST1-529214890` is a English model originally trained by `lucianpopa`.
## Predicted Entities
`1`, `0`, `4`, `3`, `2`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_sst1_529214890_en_4.2.4_3.0_1670622060169.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_sst1_529214890_en_4.2.4_3.0_1670622060169.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_sst1_529214890","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier])
data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_sst1_529214890","en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier))
val data = Seq("I love you!").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_classifier_autonlp_sst1_529214890|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|435.0 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/lucianpopa/autonlp-SST1-529214890
---
layout: model
title: Smaller BERT Sentence Embeddings (L-10_H-128_A-2)
author: John Snow Labs
name: sent_small_bert_L10_128
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_128_en_2.6.0_2.4_1598350346103.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_128_en_2.6.0_2.4_1598350346103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_128", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_128", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.small_bert_L10_128').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
en_embed_sentence_small_bert_L10_128_embeddings sentence
[-0.3761860430240631, -0.04432673007249832, 0.... I hate cancer
[-0.17762605845928192, -0.7492673397064209, -0... Antibiotics aren't painkiller
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_small_bert_L10_128|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|128|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1)
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from Moussab)
author: John Snow Labs
name: roberta_qa_deepset_base_squad2_orkg_what_1e_4
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-what-1e-4` is a English model originally trained by `Moussab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_what_1e_4_en_4.3.0_3.0_1674209722792.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_what_1e_4_en_4.3.0_3.0_1674209722792.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_what_1e_4","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_what_1e_4","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_deepset_base_squad2_orkg_what_1e_4|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-what-1e-4
---
layout: model
title: HCP Consult Classification Pipeline - Voice of the Patient
author: John Snow Labs
name: bert_sequence_classifier_vop_hcp_consult_pipeline
date: 2023-06-14
tags: [licensed, en, clinical, classification, vop]
task: Text Classification
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline includes the Medical Bert for Sequence Classification model to identify texts that mention a HCP consult. The pipeline is built on the top of [bert_sequence_classifier_vop_hcp_consult](https://nlp.johnsnowlabs.com/2023/06/13/bert_sequence_classifier_vop_hcp_consult_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_hcp_consult_pipeline_en_4.4.3_3.2_1686708308086.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_hcp_consult_pipeline_en_4.4.3_3.2_1686708308086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_sequence_classifier_vop_hcp_consult_pipeline", "en", "clinical/models")
pipeline.annotate("My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_sequence_classifier_vop_hcp_consult_pipeline", "en", "clinical/models")
val result = pipeline.annotate(My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies.)
```
## Results
```bash
| text | prediction |
|:-----------------------------------------------------------------------------------------------------------------------|:-----------------|
| My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies. | Consulted_By_HCP |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_vop_hcp_consult_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|406.4 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- MedicalBertForSequenceClassification
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8_en_4.0.0_3.0_1655732330256.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8_en_4.0.0_3.0_1655732330256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_256d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|426.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-8
---
layout: model
title: Pipeline to Detect diseases in Medical Text (biobert)
author: John Snow Labs
name: ner_diseases_biobert_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, disease, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_diseases_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_diseases_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_pipeline_en_3.4.1_3.0_1647871907471.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_pipeline_en_3.4.1_3.0_1647871907471.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_diseases_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("""Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""")
```
```scala
val pipeline = new PretrainedPipeline("ner_diseases_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.diseases_biobert.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""")
```
## Results
```bash
+-------------------------+---------+
|chunk |ner_label|
+-------------------------+---------+
|autoimmune syndrome |Disease |
|human T-cell leukemia |Disease |
|T-cell leukemia |Disease |
|Chikungunya virus disease|Disease |
+-------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_diseases_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.1 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
---
layout: model
title: Detect PHI for Generic Deidentification in Romanian (BERT)
author: John Snow Labs
name: ner_deid_generic_bert
date: 2022-07-06
tags: [deidentification, bert, phi, ner, generic, ro, licensed]
task: Named Entity Recognition
language: ro
edition: Healthcare NLP 3.5.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with bert_base_cased embeddings and can detect 7 generic entities.
This NER model is trained with a combination of custom datasets with several data augmentation mechanisms.
## Predicted Entities
`AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_3.5.0_3.0_1657112906624.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_3.5.0_3.0_1657112906624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\
.setInputCols(["sentence","token"])\
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter])
text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""
data = spark.createDataFrame([[text]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter))
val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""
val data = Seq(text).toDS.toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ro.med_ner.deid_generic_bert").predict("""
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401""")
```
## Results
```bash
+----------------------------+---------+
|chunk |ner_label|
+----------------------------+---------+
|Spitalul Pentru Ochi de Deal|LOCATION |
|Drumul Oprea Nr |LOCATION |
|972 |LOCATION |
|Vaslui |LOCATION |
|737405 |LOCATION |
|+40(235)413773 |CONTACT |
|25 May 2022 |DATE |
|BUREAN MARIA |NAME |
|77 |AGE |
|Agota Evelyn Tımar |NAME |
|2450502264401 |ID |
+----------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_generic_bert|
|Compatibility:|Healthcare NLP 3.5.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ro|
|Size:|16.5 MB|
## References
- Custom John Snow Labs datasets
- Data augmentation techniques
## Benchmarking
```bash
label precision recall f1-score support
AGE 0.95 0.97 0.96 1186
CONTACT 0.99 0.98 0.98 366
DATE 0.96 0.92 0.94 4518
ID 1.00 1.00 1.00 679
LOCATION 0.91 0.90 0.90 1683
NAME 0.93 0.96 0.94 2916
PROFESSION 0.87 0.85 0.86 161
micro-avg 0.94 0.94 0.94 11509
macro-avg 0.94 0.94 0.94 11509
weighted-avg 0.95 0.94 0.94 11509
```
---
layout: model
title: Part of Speech for Chinese
author: John Snow Labs
name: pos_ud_gsd
date: 2020-05-04 20:11:00 +0800
task: Part of Speech Tagging
language: zh
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [pos, zh]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_zh_2.5.0_2.4_1588611712161.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_zh_2.5.0_2.4_1588611712161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_gsd", "zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("除了担任北方国王之外,约翰·斯诺(John Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_gsd", "zh")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("除了担任北方国王之外,约翰·斯诺(John Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""除了担任北方国王之外,约翰·斯诺(John Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。"""]
pos_df = nlu.load('zh.pos.ud_gsd').predict(text, output_level='token')
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=20, result='NOUN', metadata={'word': '除了担任北方国王之外,约翰·斯诺(John'}),
Row(annotatorType='pos', begin=22, end=50, result='X', metadata={'word': 'Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_gsd|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|zh|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: English DistilBertForQuestionAnswering model (from Slavka)
author: John Snow Labs
name: distilbert_qa_distil_bert_finetuned_log_parser_1
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distil-bert-finetuned-log-parser-1` is a English model originally trained by `Slavka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distil_bert_finetuned_log_parser_1_en_4.0.0_3.0_1654723459354.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distil_bert_finetuned_log_parser_1_en_4.0.0_3.0_1654723459354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distil_bert_finetuned_log_parser_1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distil_bert_finetuned_log_parser_1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.log_parser.by_Slavka").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_distil_bert_finetuned_log_parser_1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Slavka/distil-bert-finetuned-log-parser-1
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from Rishav-hub)
author: John Snow Labs
name: xlmroberta_ner_rishav_hub_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Rishav-hub`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_rishav_hub_base_finetuned_panx_de_4.1.0_3.0_1660430288579.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_rishav_hub_base_finetuned_panx_de_4.1.0_3.0_1660430288579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_rishav_hub_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_rishav_hub_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_rishav_hub_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Rishav-hub/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Fast Neural Machine Translation Model from Central Bikol to English
author: John Snow Labs
name: opus_mt_bcl_en
date: 2021-06-01
tags: [open_source, seq2seq, translation, bcl, en, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: bcl
target languages: en
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_en_xx_3.1.0_2.4_1622554147726.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_en_xx_3.1.0_2.4_1622554147726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_bcl_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_bcl_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Central Bikol.translate_to.English').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_bcl_en|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Fast Neural Machine Translation Model from Ilocano to English
author: John Snow Labs
name: opus_mt_ilo_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, ilo, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `ilo`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ilo_en_xx_2.7.0_2.4_1609165094786.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ilo_en_xx_2.7.0_2.4_1609165094786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ilo_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ilo_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.ilo.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ilo_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for RxNorm (sbiobert_jsl_cased embeddings)
author: John Snow Labs
name: sbiobertresolve_rxnorm_augmented_cased
date: 2021-12-28
tags: [en, clinical, entity_resolution, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbiobert_jsl_cased` Sentence Bert Embeddings. It is trained on the augmented version of the dataset which is used in previous RxNorm resolver models. Additionally, this model returns concept classes of the drugs in all_k_aux_labels column.
## Predicted Entities
`RxNorm Codes`, `Concept Classes`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_augmented_cased_en_3.3.4_2.4_1640687886477.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_augmented_cased_en_3.3.4_2.4_1640687886477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_jsl_cased', 'en','clinical/models')\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_cased", "en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
rxnorm_pipelineModel = PipelineModel(
stages = [
documentAssembler,
sbert_embedder,
rxnorm_resolver])
light_model = LightPipeline(rxnorm_pipelineModel)
result = light_model.fullAnnotate(["Coumadin 5 mg", "aspirin", "Neurontin 300"])
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_jsl_cased", "en", "clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_cased", "en", "clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val rxnorm_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, rxnorm_resolver))
val light_model = LightPipeline(rxnorm_pipelineModel)
val result = light_model.fullAnnotate(Array("Coumadin 5 mg", "aspirin", "Neurontin 300"))
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.rxnorm.augmented_cased").predict("""Coumadin 5 mg""")
```
## Results
```bash
| | RxNormCode | Resolution | all_k_results | all_k_distances | all_k_cosine_distances | all_k_resolutions | all_k_aux_labels |
|---:|-------------:|:-------------------------------------------|:----------------------------------|:----------------------------------|:----------------------------------|:----------------------------------------------------------------|:----------------------------------|
| 0 | 855333 | warfarin sodium 5 MG [Coumadin] | 855333:::645146:::432467:::438... | 7.1909:::8.2961:::8.3727:::8.3... | 0.0887:::0.1170:::0.1176:::0.1... | warfarin sodium 5 MG [Coumadin]:::minoxidil 50 MG/ML Topical... | Branded Drug Comp:::Clinical D... |
| 1 | 1537020 | aspirin Effervescent Oral Tablet | 1537020:::1191:::437779:::7244... | 0.0000:::0.0000:::8.2570:::8.8... | 0.0000:::0.0000:::0.1147:::0.1... | aspirin Effervescent Oral Tablet:::aspirin:::aspirin / sulfu... | Clinical Drug Form:::Ingredien... |
| 2 | 105029 | gabapentin 300 MG Oral Capsule [Neurontin] | 105029:::2180332:::105852:::19... | 8.7466:::10.7744:::11.1256:::1... | 0.1212:::0.1843:::0.1981:::0.2... | gabapentin 300 MG Oral Capsule [Neurontin]:::darolutamide 30... | Branded Drug:::Branded Drug Co... |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_rxnorm_augmented_cased|
|Compatibility:|Healthcare NLP 3.3.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[rxnorm_code]|
|Language:|en|
|Size:|972.4 MB|
|Case sensitive:|false|
---
layout: model
title: Spanish RobertaForQuestionAnswering Base Cased model (from mrm8488)
author: John Snow Labs
name: roberta_qa_ruperta_base_finetuned_squadv2
date: 2022-12-02
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RuPERTa-base-finetuned-squadv2` is a Spanish model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ruperta_base_finetuned_squadv2_es_4.2.4_3.0_1669984933151.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ruperta_base_finetuned_squadv2_es_4.2.4_3.0_1669984933151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ruperta_base_finetuned_squadv2","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ruperta_base_finetuned_squadv2","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_ruperta_base_finetuned_squadv2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|470.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mrm8488/RuPERTa-base-finetuned-squadv2
---
layout: model
title: RCT Classifier (BioBERT)
author: John Snow Labs
name: bert_sequence_classifier_rct_biobert
date: 2022-03-01
tags: [licensed, en, rct, bert, sequence_classification]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [BioBERT-based](https://github.com/dmis-lab/biobert) classifier that can classify the sections within the abstracts of scientific articles regarding randomized clinical trials (RCT).
## Predicted Entities
`BACKGROUND`, `CONCLUSIONS`, `METHODS`, `OBJECTIVE`, `RESULTS`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_rct_biobert_en_3.4.1_3.0_1646127001699.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_rct_biobert_en_3.4.1_3.0_1646127001699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier_loaded = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_rct_biobert", "en", "clinical/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier_loaded
])
data = spark.createDataFrame([["""Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl ."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_rct_biobert", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq("Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.clinical_trials").predict("""Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .""")
```
## Results
```bash
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|text |class |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|[Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .]|[BACKGROUND]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_rct_biobert|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.0 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
https://arxiv.org/abs/1710.06071
## Benchmarking
```bash
label precision recall f1-score support
BACKGROUND 0.77 0.86 0.81 2000
CONCLUSIONS 0.96 0.95 0.95 2000
METHODS 0.96 0.98 0.97 2000
OBJECTIVE 0.85 0.77 0.81 2000
RESULTS 0.98 0.95 0.96 2000
accuracy 0.9 0.9 0.9 10000
macro-avg 0.9 0.9 0.9 10000
weighted-avg 0.9 0.9 0.9 10000
```
---
layout: model
title: Relation extraction between body parts and direction entities
author: John Snow Labs
name: re_bodypart_directions
date: 2021-01-18
task: Relation Extraction
language: en
nav_key: models
edition: Spark NLP for Healthcare 2.7.1
spark_version: 2.4
tags: [en, relation_extraction, clinical, licensed]
supported: true
annotator: RelationExtractionModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Relation extraction between body parts entites [Internal_organ_or_component, External_body_part_or_region] and Direction entity in clinical texts. `1` : Shows there is a relation between the body part entity and the direction entity, `0` : Shows there is no relation between the body part entity and the direction entity.
## Predicted Entities
`0`, `1`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb#scrollTo=D8TtVuN-Ee8s){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_directions_en_2.7.1_2.4_1610983817042.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_directions_en_2.7.1_2.4_1610983817042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
In the table below, `re_bodypart_directions` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated.
| RE MODEL | RE MODEL LABELS | NER MODEL | RE PAIRS |
|:----------------------:|:---------------:|:---------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| re_bodypart_directions | 0,1 | ner_jsl | [“direction-external_body_part_or_region”, “external_body_part_or_region-direction”, “direction-internal_organ_or_component”, “internal_organ_or_component-direction”] |
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
ner_tagger = MedicalNerModel()\
.pretrained("jsl_ner_wip_greedy_clinical","en","clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_chunker = NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
pair_list = ['direction-internal_organ_or_component', 'internal_organ_or_component-direction']
re_model = RelationExtractionModel().pretrained("re_bodypart_directions","en","clinical/models")\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setMaxSyntacticDistance(4)\
.setRelationPairs(pair_list)
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = LightPipeline(model).fullAnnotate(''' MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia ''')
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val ner_tagger = sparknlp.annotators.NerDLModel()
.pretrained("jsl_ner_wip_greedy_clinical","en","clinical/models")
.setInputCols("sentences", "tokens", "embeddings")
.setOutputCol("ner_tags")
val ner_chunker = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
val pair_list = Array("direction-internal_organ_or_component", "internal_organ_or_component-direction")
val re_model = RelationExtractionModel().pretrained("re_bodypart_directions","en","clinical/models")
.setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies"))
.setOutputCol("relations")
.setMaxSyntacticDistance(4)
.setRelationPairs(pair_list)
val nlpPipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model))
val text = """ MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia """
val data = Seq(text).toDS.toDF("text")
val results = pipeline.fit(data).transform(data)
```
## Results
```bash
| index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence |
|-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------|
| 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 |
| 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 |
| 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 |
| 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 |
| 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 |
| 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 |
| 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 |
| 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 |
| 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|re_bodypart_directions|
|Type:|re|
|Compatibility:|Spark NLP 2.7.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]|
|Output Labels:|[relations]|
|Language:|en|
|Dependencies:|embeddings_clinical|
## Data Source
Trained on data gathered and manually annotated by John Snow Labs
## Benchmarking
```bash
label recall precision f1
0 0.87 0.9 0.88
1 0.99 0.99 0.99
```
---
layout: model
title: English BertForQuestionAnswering Cased model (from enoriega)
author: John Snow Labs
name: bert_qa_rule_softmatching
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_softmatching` is a English model originally trained by `enoriega`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_softmatching_en_4.0.0_3.0_1657191287470.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_softmatching_en_4.0.0_3.0_1657191287470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_softmatching","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_softmatching","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_rule_softmatching|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/enoriega/rule_softmatching
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AlirezaBaneshi)
author: John Snow Labs
name: roberta_qa_autotrain_test2_756523213
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-test2-756523213` is a English model originally trained by `AlirezaBaneshi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523213_en_4.3.0_3.0_1674209108948.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523213_en_4.3.0_3.0_1674209108948.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523213","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523213","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_autotrain_test2_756523213|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|415.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AlirezaBaneshi/autotrain-test2-756523213
---
layout: model
title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili
date: 2022-08-01
tags: [sw, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: sw
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-amharic-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`.
## Predicted Entities
`PER`, `LOC`, `ORG`, `DATE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili_sw_4.1.0_3.0_1659353848315.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili_sw_4.1.0_3.0_1659353848315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili","sw") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili","sw")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|sw|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-amharic-finetuned-ner-swahili
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://github.com/masakhane-io/masakhane-ner
---
layout: model
title: Mapping Companies to NASDAQ Stock Screener by Ticker
author: John Snow Labs
name: finmapper_nasdaq_ticker_stock_screener
date: 2023-01-19
tags: [en, finance, licensed, nasdaq, ticker]
task: Chunk Mapping
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model allows you to, given a Ticker, get the following information about a company at Nasdaq Stock Screener:
- Country
- IPO_Year
- Industry
- Last_Sale
- Market_Cap
- Name
- Net_Change
- Percent_Change
- Sector
- Ticker
- Volume
Firstly, you should get the TICKER symbol from the finance text with the `finner_ticker` model, then you can get detailed information about the company with the ChunkMapper model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_ticker_stock_screener_en_1.0.0_3.0_1674157233652.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_ticker_stock_screener_en_1.0.0_3.0_1674157233652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
CM = finance.ChunkMapperModel.pretrained('finmapper_nasdaq_ticker_stock_screener', 'en', 'finance/models')\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")
pipeline = nlp.Pipeline().setStages([document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter,
CM])
text = ["""There are some serious purchases and sales of AMZN stock today."""]
test_data = spark.createDataFrame([text]).toDF("text")
model = pipeline.fit(test_data)
result = model.transform(test_data).select('mappings').collect()
```
## Results
```bash
"Country": "United States",
"IPO_Year": "1997",
"Industry": "Catalog/Specialty Distribution",
"Last_Sale": "$98.12",
"Market_Cap": "9.98556270184E11",
"Name": "Amazon.com Inc. Common Stock",
"Net_Change": "2.85",
"Percent_Change": "2.991%",
"Sector": "Consumer Discretionary",
"Ticker": "AMZN",
"Volume": "85412563"
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finmapper_nasdaq_ticker_stock_screener|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|584.5 KB|
## References
https://www.nasdaq.com/market-activity/stocks/screener
---
layout: model
title: Korean Bert Embeddings (from kykim)
author: John Snow Labs
name: bert_embeddings_bert_kor_base
date: 2022-04-11
tags: [bert, embeddings, ko, open_source]
task: Embeddings
language: ko
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-kor-base` is a Korean model orginally trained by `kykim`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_kor_base_ko_3.4.2_3.0_1649675505476.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_kor_base_ko_3.4.2_3.0_1649675505476.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_kor_base","ko") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_kor_base","ko")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.embed.bert_kor_base").predict("""나는 Spark NLP를 좋아합니다""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_kor_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ko|
|Size:|444.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/kykim/bert-kor-base
- https://github.com/kiyoungkim1/LM-kor
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739648723.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739648723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base_rule_based_only_classfn_twostage_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|463.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0
---
layout: model
title: Spanish BERT Sentence Base Uncased Embedding
author: John Snow Labs
name: sent_bert_base_uncased
date: 2021-09-06
tags: [spanish, open_source, bert_sentence_embeddings, uncased, es]
task: Embeddings
language: es
edition: Spark NLP 3.2.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
BETO is a BERT model trained on a big Spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with Multilingual BERT as well as other (not BERT-based) models.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_uncased_es_3.2.2_3.0_1630926281024.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_uncased_es_3.2.2_3.0_1630926281024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased", "es") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
```
```scala
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased", "es")
.setInputCols("sentence")
.setOutputCol("bert_sentence")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
```
{:.nlu-block}
```python
import nlu
nlu.load("es.embed_sentence.bert.base_uncased").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_base_uncased|
|Compatibility:|Spark NLP 3.2.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[bert_sentence]|
|Language:|es|
|Case sensitive:|true|
## Data Source
The model is imported from: https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased
---
layout: model
title: Thai Part of Speech Tagger (from KoichiYasuoka)
author: John Snow Labs
name: bert_pos_bert_base_thai_upos
date: 2022-05-09
tags: [bert, pos, part_of_speech, th, open_source]
task: Part of Speech Tagging
language: th
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-thai-upos` is a Thai model orginally trained by `KoichiYasuoka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_thai_upos_th_3.4.2_3.0_1652092549919.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_thai_upos_th_3.4.2_3.0_1652092549919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_thai_upos","th") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["ฉันรัก Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_thai_upos","th")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("ฉันรัก Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_bert_base_thai_upos|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|th|
|Size:|345.8 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/KoichiYasuoka/bert-base-thai-upos
- https://universaldependencies.org/u/pos/
- https://github.com/KoichiYasuoka/esupar
---
layout: model
title: Detect Genetic Cancer Entities
author: John Snow Labs
name: ner_cancer_genetics
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.Pretrained named entity recognition deep learning model for biology and genetics terms.
## Predicted Entities
``DNA``, ``RNA``, ``cell_line``, ``cell_type``, ``protein``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TUMOR/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_en_3.0.0_3.0_1617209717722.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_en_3.0.0_3.0_1617209717722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_cancer_genetics", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.']], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_cancer_genetics", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.cancer").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""")
```
## Results
```bash
+-------------------+---------+
| token|ner_label|
+-------------------+---------+
| The| O|
| human|B-protein|
| KCNJ9|I-protein|
| (| O|
| Kir|B-protein|
| 3.3|I-protein|
| ,| O|
| GIRK3|B-protein|
| )| O|
| is| O|
| a| O|
| member| O|
| of| O|
| the| O|
|G-protein-activated|B-protein|
| inwardly|I-protein|
| rectifying|I-protein|
| potassium|I-protein|
| (|I-protein|
| GIRK|I-protein|
| )|I-protein|
| channel|I-protein|
| family|I-protein|
| .| O|
| Here| O|
| we| O|
| describe| O|
| the| O|
|genomicorganization| O|
| of| O|
| the| O|
| KCNJ9| B-DNA|
| locus| I-DNA|
| on| O|
| chromosome| B-DNA|
| 1q21-23| I-DNA|
+-------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_cancer_genetics|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013 with `embeddings_clinical`.
https://aclanthology.org/W13-2008/
## Benchmarking
```bash
label tp fp fn prec rec f1
B-cell_line 581 148 151 0.79698217 0.79371583 0.79534566
I-DNA 2751 735 317 0.7891566 0.89667535 0.8394873
I-protein 4416 867 565 0.8358887 0.88656896 0.8604832
B-protein 5288 763 660 0.8739051 0.8890383 0.8814068
I-cell_line 1148 244 301 0.82471263 0.79227054 0.80816615
I-RNA 221 60 27 0.78647685 0.891129 0.83553874
B-RNA 157 40 36 0.79695433 0.8134715 0.8051282
B-cell_type 1127 292 250 0.7942213 0.8184459 0.8061516
I-cell_type 1547 392 263 0.7978339 0.85469615 0.82528675
B-DNA 1513 444 387 0.77312213 0.7963158 0.7845475
Macro-average prec: 0.8069253, rec: 0.84323275, f1: 0.82467955
Micro-average prec: 0.82471186, rec: 0.86377037, f1: 0.84378934
```
---
layout: model
title: Named Entity Recognition Profiling (Biobert)
author: John Snow Labs
name: ner_profiling_biobert
date: 2021-11-03
tags: [ner, ner_profiling, clinical, licensed, en, biobert]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 3.3.1
spark_version: 2.4
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `biobert_pubmed_base_cased`. It has been updated by adding NER model outputs to the previous version.
Here are the NER models that this pretrained pipeline includes: `ner_jsl_enriched_biobert`, `ner_clinical_biobert`, `ner_chemprot_biobert`, `ner_jsl_greedy_biobert`, `ner_bionlp_biobert`, `ner_human_phenotype_go_biobert`, `jsl_rd_ner_wip_greedy_biobert`, `ner_posology_large_biobert`, `ner_risk_factors_biobert`, `ner_anatomy_coarse_biobert`, `ner_deid_enriched_biobert`, `ner_human_phenotype_gene_biobert`, `ner_jsl_biobert`, `ner_events_biobert`, `ner_deid_biobert`, `ner_posology_biobert`, `ner_diseases_biobert`, `jsl_ner_wip_greedy_biobert`, `ner_ade_biobert`, `ner_anatomy_biobert`, `ner_cellular_biobert` .
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.2.Pretrained_NER_Profiling_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_3.3.1_2.4_1635977081207.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_3.3.1_2.4_1635977081207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models')
result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models')
val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""")
```
## Results
```bash
******************** ner_diseases_biobert Model Results ********************
[('gestational diabetes mellitus', 'Disease'), ('type two diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('HTG-induced pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('poor appetite', 'Disease'), ('vomiting', 'Disease')]
******************** ner_events_biobert Model Results ********************
[('gestational diabetes mellitus', 'PROBLEM'), ('eight years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('three years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'TEST'), ('BMI', 'TEST'), ('presented', 'OCCURRENCE'), ('a one-week', 'DURATION'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')]
******************** ner_jsl_biobert Model Results ********************
[('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Obesity'), ('body mass index', 'BMI'), ('BMI ) of 33.5 kg/m2', 'BMI'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')]
******************** ner_clinical_biobert Model Results ********************
[('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index ( BMI )', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')]
******************** ner_risk_factors_biobert Model Results ********************
[('diabetes mellitus', 'DIABETES'), ('subsequent type two diabetes mellitus', 'DIABETES'), ('obesity', 'OBESE')]
...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_profiling_biobert|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.3.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel (x21)
- NerConverter (x21)
- Finisher
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8_en_4.3.0_3.0_1674216234526.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8_en_4.3.0_3.0_1674216234526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|419.6 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-8
---
layout: model
title: Legal Condemnation Clause Binary Classifier
author: John Snow Labs
name: legclf_condemnation_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `condemnation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `condemnation`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_condemnation_clause_en_1.0.0_3.2_1660122247655.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_condemnation_clause_en_1.0.0_3.2_1660122247655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[condemnation]|
|[other]|
|[other]|
|[condemnation]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_condemnation_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
condemnation 0.96 1.00 0.98 23
other 1.00 0.99 0.99 85
accuracy - - 0.99 108
macro-avg 0.98 0.99 0.99 108
weighted-avg 0.99 0.99 0.99 108
```
---
layout: model
title: Fon asr_fonxlsr TFWav2Vec2ForCTC from chrisjay
author: John Snow Labs
name: asr_fonxlsr
date: 2022-09-24
tags: [wav2vec2, fon, audio, open_source, asr]
task: Automatic Speech Recognition
language: fon
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_fonxlsr` is a Fon model originally trained by chrisjay.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_fonxlsr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_fonxlsr_fon_4.2.0_3.0_1664024800003.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_fonxlsr_fon_4.2.0_3.0_1664024800003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_fonxlsr", "fon")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_fonxlsr", "fon")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_fonxlsr|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|fon|
|Size:|1.2 GB|
---
layout: model
title: Translate English to Mon-Khmer languages Pipeline
author: John Snow Labs
name: translate_en_mkh
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, mkh, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `mkh`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mkh_xx_2.7.0_2.4_1609689989219.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mkh_xx_2.7.0_2.4_1609689989219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_mkh", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_mkh", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.mkh').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_mkh|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Clinical Major Concepts to UMLS Code Pipeline
author: John Snow Labs
name: umls_major_concepts_resolver_pipeline
date: 2023-03-30
tags: [en, umls, licensed, pipeline, resolver, clinical]
task: Pipeline Healthcare
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline maps entities (Clinical Major Concepts) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_major_concepts_resolver_pipeline_en_4.3.2_3.2_1680192225130.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_major_concepts_resolver_pipeline_en_4.3.2_3.2_1680192225130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline= PretrainedPipeline("umls_major_concepts_resolver_pipeline", "en", "clinical/models")
pipeline.annotate("The patient complains of pustules after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline= PretrainedPipeline("umls_major_concepts_resolver_pipeline", "en", "clinical/models")
val pipeline.annotate("The patient complains of pustules after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.umls_major_concepts_resolver").predict("""The patient complains of pustules after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician""")
```
## Results
```bash
+-----------+-----------------------------------+---------+
|chunk |ner_label |umls_code|
+-----------+-----------------------------------+---------+
|pustules |Sign_or_Symptom |C0241157 |
|stairs |Daily_or_Recreational_Activity |C4300351 |
|Arthroscopy|Therapeutic_or_Preventive_Procedure|C0179144 |
+-----------+-----------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|umls_major_concepts_resolver_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|3.0 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- ChunkMapperModel
- ChunkMapperModel
- ChunkMapperFilterer
- Chunk2Doc
- BertSentenceEmbeddings
- SentenceEntityResolverModel
- ResolverMerger
---
layout: model
title: English DistilBertForQuestionAnswering model (from machine2049)
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_squad_
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad_distilbert` is a English model originally trained by `machine2049`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad__en_4.0.0_3.0_1654726959258.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad__en_4.0.0_3.0_1654726959258.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_machine2049").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_squad_|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/machine2049/distilbert-base-uncased-finetuned-squad_distilbert
---
layout: model
title: Swedish asr_test_by_marma TFWav2Vec2ForCTC from marma
author: John Snow Labs
name: asr_test_by_marma
date: 2022-09-25
tags: [wav2vec2, sv, audio, open_source, asr]
task: Automatic Speech Recognition
language: sv
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_test_by_marma` is a Swedish model originally trained by marma.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_test_by_marma_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_test_by_marma_sv_4.2.0_3.0_1664116267395.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_test_by_marma_sv_4.2.0_3.0_1664116267395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_test_by_marma", "sv")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_test_by_marma", "sv")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_test_by_marma|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|sv|
|Size:|756.2 MB|
---
layout: model
title: Chinese Word Segmentation
author: John Snow Labs
name: wordseg_msra
date: 2021-01-03
task: Word Segmentation
language: zh
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, word_segmentation, zh, cn]
supported: true
annotator: WordSegmenterModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.
References:
- Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_msra_zh_2.7.0_2.4_1609693916888.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_msra_zh_2.7.0_2.4_1609693916888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
word_segmenter = WordSegmenterModel.pretrained('wordseg_msra', 'zh')\
.setInputCols("document")\
.setOutputCol("token")
pipeline = Pipeline(stages=[document_assembler, word_segmenter])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"])
result = model.transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_msra", "zh")
.setInputCols("document")
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""然而,这样的处理也衍生了一些问题。"""]
token_df = nlu.load('zh.segment_words.msra').predict(text, output_level='token')
token_df
```
## Results
```bash
+----------------------------------+--------------------------------------------------------+
|text |result |
+----------------------------------+--------------------------------------------------------+
|然而,这样的处理也衍生了一些问题。|[然而, ,, 这样, 的, 处理, 也, 衍生, 了, 一些, 问题, 。]|
+----------------------------------+--------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|wordseg_msra|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[token]|
|Language:|zh|
## Data Source
We trained this model on the Microsoft Research Asia (MSRA) data set available on the Second International Chinese Word Segmentation Bakeoff [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005)
## Benchmarking
```bash
| Model | precision | recall | f1-score |
|---------------|--------------|--------------|--------------|
| WORSEG_CTB | 0,6453 | 0,6341 | 0,6397 |
| WORDSEG_WEIBO | 0,5454 | 0,5655 | 0,5553 |
| WORDSEG_MSRA | 0,5984 | 0,6088 | 0,6035 |
| WORDSEG_PKU | 0,6094 | 0,6321 | 0,6206 |
| WORDSEG_LARGE | 0,6326 | 0,6269 | 0,6297 |
```
---
layout: model
title: Multilingual XLMRobertaForTokenClassification Base Cased model (from V3RX2000)
author: John Snow Labs
name: xlmroberta_ner_v3rx2000_base_finetuned_panx_all
date: 2022-08-13
tags: [xx, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `V3RX2000`.
## Predicted Entities
`ORG`, `LOC`, `PER`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_v3rx2000_base_finetuned_panx_all_xx_4.1.0_3.0_1660427771019.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_v3rx2000_base_finetuned_panx_all_xx_4.1.0_3.0_1660427771019.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_v3rx2000_base_finetuned_panx_all","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_v3rx2000_base_finetuned_panx_all","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_v3rx2000_base_finetuned_panx_all|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|861.7 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/V3RX2000/xlm-roberta-base-finetuned-panx-all
---
layout: model
title: English Part of Speech Tagger (Large, UPOS-Universal Part-Of-Speech)
author: John Snow Labs
name: roberta_pos_roberta_large_english_upos
date: 2022-05-03
tags: [roberta, pos, part_of_speech, en, open_source]
task: Part of Speech Tagging
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-large-english-upos` is a English model orginally trained by `KoichiYasuoka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_pos_roberta_large_english_upos_en_3.4.2_3.0_1651596140502.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_pos_roberta_large_english_upos_en_3.4.2_3.0_1651596140502.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_roberta_large_english_upos","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_roberta_large_english_upos","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.pos.roberta_large_english_upos").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_pos_roberta_large_english_upos|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/KoichiYasuoka/roberta-large-english-upos
- https://universaldependencies.org/en/
- https://universaldependencies.org/u/pos/
- https://github.com/KoichiYasuoka/esupar
---
layout: model
title: Stop Words Cleaner for Polish
author: John Snow Labs
name: stopwords_pl
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: pl
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, pl]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_pl_pl_2.5.4_2.4_1594742438519.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_pl_pl_2.5.4_2.4_1594742438519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_pl", "pl") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_pl", "pl")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej."""]
stopword_df = nlu.load('pl.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=5, result='Oprócz', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=7, end=11, result='bycia', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=13, end=18, result='królem', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=20, end=26, result='północy', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=27, end=27, result=',', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_pl|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|pl|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Duplets)
author: John Snow Labs
name: distilbert_qa_duplets_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Duplets`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_duplets_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768419922.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_duplets_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768419922.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_duplets_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_duplets_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_duplets_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Duplets/distilbert-base-uncased-finetuned-squad
---
layout: model
title: French CamemBert Embeddings (from Henrywang)
author: John Snow Labs
name: camembert_embeddings_Henrywang_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Henrywang`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Henrywang_generic_model_fr_3.4.4_3.0_1653986322898.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Henrywang_generic_model_fr_3.4.4_3.0_1653986322898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Henrywang_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Henrywang_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_Henrywang_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Henrywang/dummy-model
---
layout: model
title: Modern Greek (1453-) asr_greek_lsr_1 TFWav2Vec2ForCTC from skylord
author: John Snow Labs
name: pipeline_asr_greek_lsr_1
date: 2022-09-25
tags: [wav2vec2, el, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: el
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_greek_lsr_1` is a Modern Greek (1453-) model originally trained by skylord.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_greek_lsr_1_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_greek_lsr_1_el_4.2.0_3.0_1664111543823.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_greek_lsr_1_el_4.2.0_3.0_1664111543823.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_greek_lsr_1', lang = 'el')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_greek_lsr_1", lang = "el")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_greek_lsr_1|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|el|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Ganda asr_wav2vec2_luganda_by_cahya TFWav2Vec2ForCTC from cahya
author: John Snow Labs
name: asr_wav2vec2_luganda_by_cahya
date: 2022-09-24
tags: [wav2vec2, lg, audio, open_source, asr]
task: Automatic Speech Recognition
language: lg
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_luganda_by_cahya` is a Ganda model originally trained by cahya.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_luganda_by_cahya_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_luganda_by_cahya_lg_4.2.0_3.0_1664037739573.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_luganda_by_cahya_lg_4.2.0_3.0_1664037739573.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_luganda_by_cahya", "lg")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_luganda_by_cahya", "lg")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_luganda_by_cahya|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|lg|
|Size:|1.2 GB|
---
layout: model
title: Fast Neural Machine Translation Model from San Salvador Kongo to English
author: John Snow Labs
name: opus_mt_kwy_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, kwy, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `kwy`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_kwy_en_xx_2.7.0_2.4_1609170973770.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_kwy_en_xx_2.7.0_2.4_1609170973770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_kwy_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_kwy_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.kwy.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_kwy_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: BioBERT Embeddings (PMC)
author: John Snow Labs
name: biobert_pmc_base_cased
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pmc_base_cased_en_2.6.0_2.4_1598343018425.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pmc_base_cased_en_2.6.0_2.4_1598343018425.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("biobert_pmc_base_cased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("biobert_pmc_base_cased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I hate cancer").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer"]
embeddings_df = nlu.load('en.embed.biobert.pmc_base_cased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_biobert_pmc_base_cased_embeddings
I [0.0654267892241478, 0.06330983340740204, 0.13...
hate [0.3058323264122009, 0.4778319299221039, -0.09...
cancer [0.3130614757537842, 0.024675076827406883, -0....
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|biobert_pmc_base_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert)
---
layout: model
title: German asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728` is a German model originally trained by jonatasgrosman.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728_de_4.2.0_3.0_1664115088643.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728_de_4.2.0_3.0_1664115088643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: Abkhazian asr_wav2vec2_common_voice_ab_demo TFWav2Vec2ForCTC from patrickvonplaten
author: John Snow Labs
name: asr_wav2vec2_common_voice_ab_demo
date: 2022-09-24
tags: [wav2vec2, ab, audio, open_source, asr]
task: Automatic Speech Recognition
language: ab
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_common_voice_ab_demo` is a Abkhazian model originally trained by patrickvonplaten.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_common_voice_ab_demo_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_common_voice_ab_demo_ab_4.2.0_3.0_1664042256123.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_common_voice_ab_demo_ab_4.2.0_3.0_1664042256123.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_common_voice_ab_demo", "ab")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_common_voice_ab_demo", "ab")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_common_voice_ab_demo|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|ab|
|Size:|1.2 GB|
---
layout: model
title: Mapping Entities with Corresponding RxNorm Codes and Normalized Names
author: John Snow Labs
name: rxnorm_normalized_mapper
date: 2022-09-29
tags: [en, clinical, licensed, rxnorm, chunk_mapping]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline maps entities with their corresponding RxNorm codes and normalized RxNorm resolutions.
## Predicted Entities
`rxnorm_code`, `normalized_name`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_normalized_mapper_en_4.1.0_3.0_1664443862683.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_normalized_mapper_en_4.1.0_3.0_1664443862683.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
posology_ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("posology_ner")
posology_ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "posology_ner"])\
.setOutputCol("ner_chunk")
chunkerMapper = ChunkMapperModel.pretrained("rxnorm_normalized_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setRels(["rxnorm_code", "normalized_name"])
mapper_pipeline = Pipeline().setStages([
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
posology_ner_model,
posology_ner_converter,
chunkerMapper])
data = spark.createDataFrame([["The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray"]]).toDF("text")
result= mapper_pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val posology_ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("posology_ner")
val posology_ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "posology_ner"))
.setOutputCol("ner_chunk")
val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_normalized_mapper", "en", "clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("mappings")
.setRels(Array("rxnorm_code", "normalized_name"))
val mapper_pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
posology_ner_model,
posology_ner_converter,
chunkerMapper))
val data = Seq("The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray").toDS.toDF("text")
val result = mapper_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.rxnorm_normalized").predict("""The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray""")
```
## Results
```bash
+------------------------------+-----------+--------------------------------------------------------------+
|ner_chunk |rxnorm_code|normalized_name |
+------------------------------+-----------+--------------------------------------------------------------+
|Zyrtec 10 MG |1011483 |cetirizine hydrochloride 10 MG [Zyrtec] |
|Adapin 10 MG Oral Capsule |1000050 |doxepin hydrochloride 10 MG Oral Capsule [Adapin] |
|Septi-Soothe 0.5 Topical Spray|1000046 |chlorhexidine diacetate 0.5 MG/ML Topical Spray [Septi-Soothe]|
+------------------------------+-----------+--------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|rxnorm_normalized_mapper|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|10.7 MB|
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from Slavka)
author: John Snow Labs
name: distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-log-parser-winlogbeat` is a English model originally trained by `Slavka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat_en_4.0.0_3.0_1654723355284.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat_en_4.0.0_3.0_1654723355284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.base_cased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Slavka/bert-base-cased-finetuned-log-parser-winlogbeat
---
layout: model
title: Summarize clinical notes
author: John Snow Labs
name: summarizer_clinical_jsl
date: 2023-03-25
tags: [en, licensed, clinical, summarization, tensorflow]
task: Summarization
language: en
edition: Healthcare NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MedicalSummarizer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Summarize clinical notes, encounters, critical care notes, discharge notes, reports, etc.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_SUMMARIZATION/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_en_4.3.1_3.0_1679772340755.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_en_4.3.1_3.0_1679772340755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document = DocumentAssembler().setInputCol('text').setOutputCol('document')
summarizer = MedicalSummarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models").setInputCols(['document'])\
.setOutputCol('summary')\
.setMaxTextLength(512)\
.setMaxNewTokens(512)
pipeline = sparknlp.base.Pipeline(stages=[
document,
summarizer
])
text = """Patient with hypertension, syncope, and spinal stenosis - for recheck.
(Medical Transcription Sample Report)
SUBJECTIVE:
The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.
PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS:
Reviewed and unchanged from the dictation on 12/03/2003.
MEDICATIONS:
Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val summarizer = MedicalSummarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models")
.setInputCols(['document'])
.setOutputCol('summary')
.setMaxTextLength(512)
.setMaxNewTokens(512)
val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer))
val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck.
(Medical Transcription Sample Report)
SUBJECTIVE:
The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.
PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS:
Reviewed and unchanged from the dictation on 12/03/2003.
MEDICATIONS:
Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash."""
val data = Seq(text).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
A 78-year-old female with hypertension, syncope, and spinal stenosis returns for recheck. She denies chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. She is on multiple medications and has Elocon cream and Synalar cream for rash.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|summarizer_clinical_jsl|
|Compatibility:|Healthcare NLP 4.3.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|920.1 MB|
## References
Trained on in-house curated dataset
---
layout: model
title: Sentence Entity Resolver for billable ICD10-CM HCC codes (Slim, JSL Medium Bert)
author: John Snow Labs
name: sbertresolve_icd10cm_slim_billable_hcc_med
date: 2021-05-21
tags: [licensed, clinical, en, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD10-CM codes using sentence embeddings. This model has been augmented with synonyms, and synonyms having low cosine similarity are dropped, making the model slim. It utilises fine-tuned `sbert_jsl_medium_uncased` Sentence Bert Model.
## Predicted Entities
Outputs 7-digit billable ICD codes. In the result, look for aux_label parameter in the metadata to get HCC status. The HCC status can be divided to get further information: billable status, hcc status, and hcc score.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_med_en_3.0.4_2.4_1621590174924.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_med_en_3.0.4_2.4_1621590174924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel\
.pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models")\
.setInputCols(["document", "sbert_embeddings"])\
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN")\
.setReturnCosineDistances(True)
bert_pipeline_icd = Pipeline(stages = [document_assembler, sbert_embedder, icd10_resolver])
data = spark.createDataFrame([["metastatic lung cancer"]]).toDF("text")
results = bert_pipeline_icd.fit(data).transform(data)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel
.pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models")
.setInputCols(Array("document", "sbert_embeddings"))
.setOutputCol("icd10cm_code")
.setDistanceFunction("EUCLIDEAN")
.setReturnCosineDistances(True)
val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver))
val data = Seq("metastatic lung cancer").toDF("text")
val result = bert_pipeline_icd.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10cm.slim_billable_hcc_med").predict("""metastatic lung cancer""")
```
## Results
```bash
| | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances |
|---:|:-----------------------|:-------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------|
| 0 | metastatic lung cancer | C7800 | ['cancer metastatic to lung', 'metastasis from malignant tumor of lung', 'cancer metastatic to left lung', 'history of cancer metastatic to lung', 'metastatic cancer', 'history of cancer metastatic to lung (situation)', 'metastatic adenocarcinoma to bilateral lungs', 'cancer metastatic to chest wall', 'metastatic malignant neoplasm to left lower lobe of lung', 'metastatic carcinoid tumour', 'cancer metastatic to respiratory tract', 'metastatic carcinoid tumor'] | ['C7800', 'C349', 'C7801', 'Z858', 'C800', 'Z8511', 'C780', 'C798', 'C7802', 'C799', 'C7830', 'C7B00'] | ['1', '1', '8'] | ['0.0464', '0.0829', '0.0852', '0.0860', '0.0914', '0.0989', '0.1133', '0.1220', '0.1220', '0.1253', '0.1249', '0.1260'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbertresolve_icd10cm_slim_billable_hcc_med|
|Compatibility:|Healthcare NLP 3.0.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk, sbert_embeddings]|
|Output Labels:|[icd10_code]|
|Language:|en|
|Case sensitive:|false|
---
layout: model
title: English DistilBertForQuestionAnswering model (from twmkn9)
author: John Snow Labs
name: distilbert_qa_base_uncased_squad2
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2` is a English model originally trained by `twmkn9`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_en_4.0.0_3.0_1654727261779.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_en_4.0.0_3.0_1654727261779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.distil_bert.base_uncased.by_twmkn9").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/twmkn9/distilbert-base-uncased-squad2
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_6_h_128
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-128` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_128_zh_4.2.4_3.0_1670325968748.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_128_zh_4.2.4_3.0_1670325968748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_128","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_128","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_6_h_128|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|15.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-6_H-128
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Legal Transition Services Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_transition_services_agreement_bert
date: 2022-11-25
tags: [en, legal, classification, agreement, transition_services, licensed, bert]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_transition_services_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `transition-services-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`transition-services-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_transition_services_agreement_bert_en_1.0.0_3.0_1669372317483.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_transition_services_agreement_bert_en_1.0.0_3.0_1669372317483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[letter-agreement]|
|[other]|
|[other]|
|[letter-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_letter_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
letter-agreement 0.94 0.87 0.90 38
other 0.93 0.97 0.95 65
accuracy - - 0.93 103
macro-avg 0.93 0.92 0.93 103
weighted-avg 0.93 0.93 0.93 103
```
---
layout: model
title: Stop Words Cleaner for Galician
author: John Snow Labs
name: stopwords_gl
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: gl
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, gl]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_gl_gl_2.5.4_2.4_1594742441210.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_gl_gl_2.5.4_2.4_1594742441210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_gl", "gl") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_gl", "gl")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica."""]
stopword_df = nlu.load('gl.stopwords').predict(text)
stopword_df[["cleanTokens"]]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=6, result='Ademais', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=17, end=19, result='rei', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=24, end=28, result='norte', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=29, end=29, result=',', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=31, end=34, result='John', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_gl|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|gl|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: Turkish BertForQuestionAnswering model (from yunusemreemik)
author: John Snow Labs
name: bert_qa_logo_qna_model
date: 2022-06-02
tags: [tr, open_source, question_answering, bert]
task: Question Answering
language: tr
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `logo-qna-model` is a Turkish model orginally trained by `yunusemreemik`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_logo_qna_model_tr_4.0.0_3.0_1654188164539.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_logo_qna_model_tr_4.0.0_3.0_1654188164539.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_logo_qna_model","tr") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_logo_qna_model","tr")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("tr.answer_question.bert.by_yunusemreemik").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_logo_qna_model|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|tr|
|Size:|412.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/yunusemreemik/logo-qna-model
---
layout: model
title: Swedish asr_Wav2Vec2_large_xlsr_welsh TFWav2Vec2ForCTC from Srulikbdd
author: John Snow Labs
name: asr_Wav2Vec2_large_xlsr_welsh
date: 2022-09-25
tags: [wav2vec2, sv, audio, open_source, asr]
task: Automatic Speech Recognition
language: sv
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_large_xlsr_welsh` is a Swedish model originally trained by Srulikbdd.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Wav2Vec2_large_xlsr_welsh_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_large_xlsr_welsh_sv_4.2.0_3.0_1664115056212.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_large_xlsr_welsh_sv_4.2.0_3.0_1664115056212.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_Wav2Vec2_large_xlsr_welsh", "sv")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_Wav2Vec2_large_xlsr_welsh", "sv")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_Wav2Vec2_large_xlsr_welsh|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|sv|
|Size:|1.2 GB|
---
layout: model
title: Bert for Sequence Classification (Question vs Statement)
author: John Snow Labs
name: bert_sequence_classifier_question_statement
date: 2021-11-04
tags: [question, statement, en, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 3.3.2
spark_version: 3.0
supported: true
annotator: BertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Trained to add sentence classifying capabilities to distinguish between Question vs Statements.
This model was imported from Hugging Face (https://huggingface.co/shahrukhx01/question-vs-statement-classifier), and trained based on Haystack (https://github.com/deepset-ai/haystack/issues/611).
## Predicted Entities
`question`, `statement`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_question_statement_en_3.3.2_3.0_1636038134936.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_question_statement_en_3.3.2_3.0_1636038134936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
seq = BertForSequenceClassification.pretrained('bert_sequence_classifier_question_statement', 'en')\
.setInputCols(["token", "sentence"])\
.setOutputCol("label")\
.setCaseSensitive(True)
pipeline = Pipeline(stages = [
documentAssembler,
sentenceDetector,
tokenizer,
seq])
test_sentences = ["""What feature in your car did you not realize you had until someone else told you about it?
Years ago, my Dad bought me a cute little VW Beetle. The first day I had it, me and my BFF were sitting in my car looking at everything.
When we opened the center console, we had quite the scare. Inside was a hollowed out, plastic phallic looking object with tiny spikes on it.
My friend and I literally screamed in horror. It was clear to us that somehow someone left their “toy” in my new car! We were shook, as they say.
This was my car, I had to do something. So, I used a pen to pick up the nasty looking thing and threw it out.
We freaked out about how gross it was and then we forgot about it… until my Dad called me.
My Dad said: How’s the new car? Have you seen the flower holder in the center console?
To summarize, we thought a flower vase was an XXX item…
In our defense, this is a picture of a VW Beetle flower holder."""]
import pandas as pd
data=spark.createDataFrame(pd.DataFrame({'text': test_sentences}))
res = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val seq = BertForSequenceClassification.pretrained("bert_sequence_classifier_question_statement", "en")
.setInputCols(Array("token", "sentence"))
.setOutputCol("label")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
seq))
val test_sentences = "What feature in your car did you not realize you had until someone else told you about it?
Years ago, my Dad bought me a cute little VW Beetle. The first day I had it, me and my BFF were sitting in my car looking at everything.
When we opened the center console, we had quite the scare. Inside was a hollowed out, plastic phallic looking object with tiny spikes on it.
My friend and I literally screamed in horror. It was clear to us that somehow someone left their “toy” in my new car! We were shook, as they say.
This was my car, I had to do something. So, I used a pen to pick up the nasty looking thing and threw it out.
We freaked out about how gross it was and then we forgot about it… until my Dad called me.
My Dad said: How’s the new car? Have you seen the flower holder in the center console?
To summarize, we thought a flower vase was an XXX item…
In our defense, this is a picture of a VW Beetle flower holder."
val example = Seq(test_sentences).toDF("text")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.question_vs_statement").predict("""What feature in your car did you not realize you had until someone else told you about it?
Years ago, my Dad bought me a cute little VW Beetle. The first day I had it, me and my BFF were sitting in my car looking at everything.
When we opened the center console, we had quite the scare. Inside was a hollowed out, plastic phallic looking object with tiny spikes on it.
My friend and I literally screamed in horror. It was clear to us that somehow someone left their “toy” in my new car! We were shook, as they say.
This was my car, I had to do something. So, I used a pen to pick up the nasty looking thing and threw it out.
We freaked out about how gross it was and then we forgot about it… until my Dad called me.
My Dad said: How’s the new car? Have you seen the flower holder in the center console?
To summarize, we thought a flower vase was an XXX item…
In our defense, this is a picture of a VW Beetle flower holder.""")
```
## Results
```bash
+------------------------------------------------------------------------------------------+---------+
| sentence| label|
+------------------------------------------------------------------------------------------+---------+
|What feature in your car did you not realize you had until someone else told you about it?| question|
| Years ago, my Dad bought me a cute little VW Beetle.|statement|
| The first day I had it, me and my BFF were sitting in my car looking at everything.|statement|
| When we opened the center console, we had quite the scare.|statement|
| Inside was a hollowed out, plastic phallic looking object with tiny spikes on it.|statement|
| My friend and I literally screamed in horror.|statement|
| It was clear to us that somehow someone left their “toy” in my new car!|statement|
| We were shook, as they say.|statement|
| This was my car, I had to do something.|statement|
| So, I used a pen to pick up the nasty looking thing and threw it out.|statement|
|We freaked out about how gross it was and then we forgot about it… until my Dad called me.|statement|
| My Dad said: How’s the new car?| question|
| Have you seen the flower holder in the center console?| question|
| To summarize, we thought a flower vase was an XXX item…|statement|
| In our defense, this is a picture of a VW Beetle flower holder.|statement|
+------------------------------------------------------------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_question_statement|
|Compatibility:|Spark NLP 3.3.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[label]|
|Language:|en|
|Case sensitive:|true|
## Data Source
https://github.com/deepset-ai/haystack/issues/611
## Benchmarking
```bash
Extracted from https://github.com/deepset-ai/haystack/issues/611
precision recall f1-score support
statement 0.94 0.94 0.94 16105
question 0.96 0.96 0.96 26198
accuracy 0.95 42303
macro avg 0.95 0.95 0.95 42303
weighted avg 0.95 0.95 0.95 42303
```
---
layout: model
title: Legal Court Judgment Prediction (Portuguese)
author: John Snow Labs
name: legclf_judgment_prediction
date: 2023-04-06
tags: [pt, licensed, legal, classification, tensorflow]
task: Text Classification
language: pt
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Multiclass classification model which identifies the court decisions in the State Supreme Court, including the following classes;
- no: The appeal was denied
- partial: For partially favourable decisions
- yes: For fully favourable decisions
## Predicted Entities
`no`, `partial`, `yes`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_judgment_prediction_pt_1.0.0_3.0_1680778981035.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_judgment_prediction_pt_1.0.0_3.0_1680778981035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler= nlp.DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = nlp.Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
seq_classifier = legal.BertForSequenceClassification.pretrained("legclf_judgment_prediction", "pt", "legal/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer ,
seq_classifier
])
# simple examples
example = spark.createDataFrame([["PENAL. PROCESSO PENAL. APELAÇÃO. HOMICÍDIO QUALIFICADO. ARGUIÇÃO DE NULIDADE EM DECORRÊNCIA DA INSCONSTITUCIONALIDADE DO ARTIGO 457 DO CÓDIGO DE PROCESSO PENAL. AFASTADA. PLEITO DE REDIMENSIONAMENTO DA PENA. DOSIMETRIA QUE MERECE RETOQUES. AFASTADA A VALORAÇÃO DESFAVORÁVEL DAS CIRCUNSTÂNCIAS JUDICIAIS DOS ANTECEDENTES E DA PERSONALIDADE DO AGENTE. MANTIDA A CULPABILIDADE, CIRCUNSTÂNCIAS DO DELITO E CONSEQUÊNCIAS DO CRIME. APELO CONHECIDO E PARCIALMENTE PROVIDO. 1 Não há falar em ocorrência de nulidade não caso concreto, não existindo qualquer inconstitucionalidade em virtude do texto legal do ARTIGO 457 do Código de Processo Penal, não tendo ocorrido o adiamento da sessão do júri em virtude da ausência do acusado, conforme alegado. Pelo contrário, este foi devidamente intimado por edital e, mesmo assim, restou ausente. 2 A justificativa apresentada pelo magistrado singular acerca da culpabilidade considerou a alta reprovabilidade da conduta do réu, em virtude da premeditação e frieza na prática delitiva, considerando que o acusado foi até a casa da vítima com o intuito de ceifar a sua vida, experimentando assim a consequência da transgressão, estando acertada a valoração negativa desta circunstância judicial."]]).toDF("text")
result = pipeline.fit(example).transform(example)
# result is a DataFrame
result.select("text", "class.result").show(truncate=100)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+---------+
| text| result|
+----------------------------------------------------------------------------------------------------+---------+
|PENAL. PROCESSO PENAL. APELAÇÃO. HOMICÍDIO QUALIFICADO. ARGUIÇÃO DE NULIDADE EM DECORRÊNCIA DA IN...|[partial]|
+----------------------------------------------------------------------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_judgment_prediction|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|pt|
|Size:|408.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## Benchmarking
```bash
label precision recall f1-score support
no 0.76 0.77 0.76 86
partial 0.79 0.71 0.75 75
yes 0.71 0.78 0.74 76
accuracy - - 0.75 237
macro-avg 0.75 0.75 0.75 237
weighted-avg 0.75 0.75 0.75 237
```
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-128-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42_en_4.0.0_3.0_1657184284783.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42_en_4.0.0_3.0_1657184284783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-128-finetuned-squad-seed-42
---
layout: model
title: English image_classifier_vit__spectrogram ViTForImageClassification from prashanth0205
author: John Snow Labs
name: image_classifier_vit__spectrogram
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit__spectrogram` is a English model originally trained by prashanth0205.
## Predicted Entities
`female`, `male`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit__spectrogram_en_4.1.0_3.0_1660170988681.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit__spectrogram_en_4.1.0_3.0_1660170988681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit__spectrogram", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit__spectrogram", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit__spectrogram|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English BertForQuestionAnswering model (from tli8hf)
author: John Snow Labs
name: bert_qa_unqover_bert_base_uncased_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-bert-base-uncased-squad` is a English model orginally trained by `tli8hf`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_unqover_bert_base_uncased_squad_en_4.0.0_3.0_1654192543690.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_unqover_bert_base_uncased_squad_en_4.0.0_3.0_1654192543690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_unqover_bert_base_uncased_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_unqover_bert_base_uncased_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base_uncased.by_tli8hf").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_unqover_bert_base_uncased_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/tli8hf/unqover-bert-base-uncased-squad
---
layout: model
title: Pipeline to Detect Units and Measurements
author: John Snow Labs
name: ner_measurements_clinical_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, measurements, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_measurements_clinical](https://nlp.johnsnowlabs.com/2021/04/01/ner_measurements_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_pipeline_en_3.4.1_3.0_1647870532389.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_pipeline_en_3.4.1_3.0_1647870532389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_measurements_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("EXAMPLE MEDICAL TEXT")
```
```scala
val pipeline = new PretrainedPipeline("ner_measurements_clinical_pipeline", "en", "clinical/models")
pipeline.annotate("EXAMPLE MEDICAL TEXT")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.clinical_measurements.pipeline").predict("""EXAMPLE MEDICAL TEXT""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_measurements_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: English asr_wav2vec2_cetuc_sid_voxforge_mls_0 TFWav2Vec2ForCTC from joaoalvarenga
author: John Snow Labs
name: asr_wav2vec2_cetuc_sid_voxforge_mls_0
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_cetuc_sid_voxforge_mls_0` is a English model originally trained by joaoalvarenga.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_cetuc_sid_voxforge_mls_0_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_cetuc_sid_voxforge_mls_0_en_4.2.0_3.0_1664022744764.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_cetuc_sid_voxforge_mls_0_en_4.2.0_3.0_1664022744764.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_cetuc_sid_voxforge_mls_0", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_cetuc_sid_voxforge_mls_0", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_cetuc_sid_voxforge_mls_0|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: SDOH Insurance Type For Classification
author: John Snow Labs
name: genericclassifier_sdoh_insurance_type_sbiobert_cased_mli
date: 2023-04-28
tags: [en, insurance, sdoh, social_determinants, public_health, classificaiton, licensed]
task: Text Classification
language: en
edition: Healthcare NLP 4.4.0
spark_version: 3.0
supported: true
annotator: GenericClassifierModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Generic Classifier model is intended for detecting insurance type. In this classifier, we know/assume that the patient **has insurance**.
If the patient's insurance type is not mentioned or not known, it is regarded as an "Other" type of insurance. And if the patient's insurance is one of "Tricare" or "VA (Veterans Affair)", it is labeled as Military. The model is trained by using GenericClassifierApproach annotator.
`Employer`: Employer insurance.
`Medicaid`: Medicaid insurance.
`Medicare`: Medicare insurance.
`Military`: "Tricare" or "VA (Veterans Affair)" insurance.
`Private`: Private insurance.
`Other`: Other insurance or insurance type is not mentioned in the clinical notes or is not known.
## Predicted Entities
`Employer`, `Medicaid`, `Medicare`, `Military`, `Private`, `Other`
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/social_determinant){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_insurance_type_sbiobert_cased_mli_en_4.4.0_3.0_1682694596560.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_insurance_type_sbiobert_cased_mli_en_4.4.0_3.0_1682694596560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
features_asm = FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_insurance_type_sbiobert_cased_mli", 'en', 'clinical/models')\
.setInputCols(["features"])\
.setOutputCol("prediction")
pipeline = Pipeline(stages=[
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier
])
text_list = [
"The patient has VA insurance.",
"She is under Medicare insurance",
"The patient has good coverage of Private insurance",
"""Medical File for John Smith, Male, Age 42
Chief Complaint: Patient complains of nausea, vomiting, and shortness of breath.
History of Present Illness: The patient has a history of hypertension and diabetes, which are both poorly controlled. The patient has been feeling unwell for the past week, with symptoms including nausea, vomiting, and shortness of breath. Upon examination, the patient was found to have a high serum creatinine level of 5.8 mg/dL, indicating renal failure.
Past Medical History: The patient has a history of hypertension and diabetes, which have been poorly controlled due to poor medication adherence. The patient also has a history of smoking, which has been a contributing factor to the development of renal failure.
Medications: The patient is currently taking Metformin and Lisinopril for the management of diabetes and hypertension, respectively. However, due to poor Medicaid coverage, the patient is unable to afford some of the medications prescribed by his physician.
Insurance Status: The patient has Medicaid insurance, which provides poor coverage for some of the medications needed to manage his medical conditions, including those related to his renal failure.
Physical Examination: Upon physical examination, the patient appears pale and lethargic. Blood pressure is 160/100 mmHg, heart rate is 90 beats per minute, and respiratory rate is 20 breaths per minute. There is diffuse abdominal tenderness on palpation, and lung auscultation reveals bilateral rales.
Diagnosis: The patient is diagnosed with acute renal failure, likely due to uncontrolled hypertension and poorly managed diabetes.
Treatment: The patient is started on intravenous fluids and insulin to manage his blood sugar levels. Due to the patient's poor Medicaid coverage, the physician works with the patient to identify alternative medications that are more affordable and will still provide effective management of his medical conditions.
Follow-Up: The patient is advised to follow up with his primary care physician for ongoing management of his renal failure and other medical conditions. The patient is also referred to a nephrologist for further evaluation and management of his renal failure.
""",
"""Certainly, here is an example case study for a patient with private insurance:
Case Study for Emily Chen, Female, Age 38
Chief Complaint: Patient reports chronic joint pain and stiffness.
History of Present Illness: The patient has been experiencing chronic joint pain and stiffness, particularly in the hands, knees, and ankles. The pain is worse in the morning and improves throughout the day. The patient has also noticed some swelling and redness in the affected joints.
Past Medical History: The patient has a history of osteoarthritis, which has been gradually worsening over the past several years. The patient has tried over-the-counter pain relievers and joint supplements, but has not found significant relief.
Medications: The patient is currently taking over-the-counter pain relievers and joint supplements for the management of her osteoarthritis.
Insurance Status: The patient has private insurance, which provides comprehensive coverage for her medical care, including specialist visits and prescription medications.
Physical Examination: Upon physical examination, the patient has tenderness and swelling in multiple joints, particularly the hands, knees, and ankles. Range of motion is limited due to pain and stiffness.
Diagnosis: The patient is diagnosed with osteoarthritis, a chronic degenerative joint disease that causes pain, swelling, and stiffness in the affected joints.
Treatment: The patient is prescribed a nonsteroidal anti-inflammatory drug (NSAID) to manage pain and inflammation. The physician also recommends physical therapy to improve range of motion and strengthen the affected joints. The patient is advised to continue taking joint supplements for ongoing joint health.
Follow-Up: The patient is advised to follow up with the physician in 4-6 weeks to monitor response to treatment and make any necessary adjustments. The patient is also referred to a rheumatologist for further evaluation and management of her osteoarthritis.""",
"""
Medical File for John Doe, Male, Age 72
Chief Complaint: Patient reports shortness of breath and fatigue.
History of Present Illness: The patient has been experiencing shortness of breath and fatigue for the past several weeks. The patient reports difficulty performing daily activities and has noticed a decrease in exercise tolerance.
Past Medical History: The patient has a history of hypertension, hyperlipidemia, and coronary artery disease. The patient has undergone a coronary artery bypass graft (CABG) surgery in the past.
Medications: The patient is currently taking several medications, including a beta blocker, a statin, and a diuretic, for the management of his medical conditions.
Insurance Status: The patient has good coverage of Medicare insurance, which provides comprehensive coverage for his medical care, including specialist visits, diagnostic tests, and prescription medications.
Physical Examination: Upon physical examination, the patient has crackles in the lungs and peripheral edema. Blood pressure is elevated, and heart sounds are irregular.
Diagnosis: The patient is diagnosed with congestive heart failure, a chronic condition in which the heart cannot pump blood effectively to meet the body's needs.
Treatment: The patient is admitted to the hospital for further evaluation and management of his congestive heart failure. Treatment includes diuresis to remove excess fluid, medication management to control blood pressure and heart rate, and oxygen therapy to improve breathing. The patient is also advised to follow a low-sodium diet and to monitor his fluid intake closely.
Follow-Up: The patient is advised to follow up with his primary care physician and cardiologist regularly to monitor his heart function and adjust treatment as necessary. The patient is also referred to cardiac rehabilitation to improve his exercise tolerance and overall cardiovascular health."""]
df = spark.createDataFrame(text_list, StringType()).toDF("text")
result = pipeline.fit(df).transform(df)
result.select("text", "prediction.result").show(truncate=100)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("features")
val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_insurance_type_sbiobert_cased_mli", "en", "clinical/models")
.setInputCols("features")
.setOutputCol("prediction")
val pipeline = new PipelineModel().setStages(Array(
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier))
val data = Seq(Array(
"The patient has VA insurance.",
"She is under Medicare insurance",
"The patient has good coverage of Private insurance",
"""Medical File for John Smith, Male, Age 42
Chief Complaint: Patient complains of nausea, vomiting, and shortness of breath.
History of Present Illness: The patient has a history of hypertension and diabetes, which are both poorly controlled. The patient has been feeling unwell for the past week, with symptoms including nausea, vomiting, and shortness of breath. Upon examination, the patient was found to have a high serum creatinine level of 5.8 mg/dL, indicating renal failure.
Past Medical History: The patient has a history of hypertension and diabetes, which have been poorly controlled due to poor medication adherence. The patient also has a history of smoking, which has been a contributing factor to the development of renal failure.
Medications: The patient is currently taking Metformin and Lisinopril for the management of diabetes and hypertension, respectively. However, due to poor Medicaid coverage, the patient is unable to afford some of the medications prescribed by his physician.
Insurance Status: The patient has Medicaid insurance, which provides poor coverage for some of the medications needed to manage his medical conditions, including those related to his renal failure.
Physical Examination: Upon physical examination, the patient appears pale and lethargic. Blood pressure is 160/100 mmHg, heart rate is 90 beats per minute, and respiratory rate is 20 breaths per minute. There is diffuse abdominal tenderness on palpation, and lung auscultation reveals bilateral rales.
Diagnosis: The patient is diagnosed with acute renal failure, likely due to uncontrolled hypertension and poorly managed diabetes.
Treatment: The patient is started on intravenous fluids and insulin to manage his blood sugar levels. Due to the patient's poor Medicaid coverage, the physician works with the patient to identify alternative medications that are more affordable and will still provide effective management of his medical conditions.
Follow-Up: The patient is advised to follow up with his primary care physician for ongoing management of his renal failure and other medical conditions. The patient is also referred to a nephrologist for further evaluation and management of his renal failure.
""",
"""Certainly, here is an example case study for a patient with private insurance:
Case Study for Emily Chen, Female, Age 38
Chief Complaint: Patient reports chronic joint pain and stiffness.
History of Present Illness: The patient has been experiencing chronic joint pain and stiffness, particularly in the hands, knees, and ankles. The pain is worse in the morning and improves throughout the day. The patient has also noticed some swelling and redness in the affected joints.
Past Medical History: The patient has a history of osteoarthritis, which has been gradually worsening over the past several years. The patient has tried over-the-counter pain relievers and joint supplements, but has not found significant relief.
Medications: The patient is currently taking over-the-counter pain relievers and joint supplements for the management of her osteoarthritis.
Insurance Status: The patient has private insurance, which provides comprehensive coverage for her medical care, including specialist visits and prescription medications.
Physical Examination: Upon physical examination, the patient has tenderness and swelling in multiple joints, particularly the hands, knees, and ankles. Range of motion is limited due to pain and stiffness.
Diagnosis: The patient is diagnosed with osteoarthritis, a chronic degenerative joint disease that causes pain, swelling, and stiffness in the affected joints.
Treatment: The patient is prescribed a nonsteroidal anti-inflammatory drug (NSAID) to manage pain and inflammation. The physician also recommends physical therapy to improve range of motion and strengthen the affected joints. The patient is advised to continue taking joint supplements for ongoing joint health.
Follow-Up: The patient is advised to follow up with the physician in 4-6 weeks to monitor response to treatment and make any necessary adjustments. The patient is also referred to a rheumatologist for further evaluation and management of her osteoarthritis.""",
"""
Medical File for John Doe, Male, Age 72
Chief Complaint: Patient reports shortness of breath and fatigue.
History of Present Illness: The patient has been experiencing shortness of breath and fatigue for the past several weeks. The patient reports difficulty performing daily activities and has noticed a decrease in exercise tolerance.
Past Medical History: The patient has a history of hypertension, hyperlipidemia, and coronary artery disease. The patient has undergone a coronary artery bypass graft (CABG) surgery in the past.
Medications: The patient is currently taking several medications, including a beta blocker, a statin, and a diuretic, for the management of his medical conditions.
Insurance Status: The patient has good coverage of Medicare insurance, which provides comprehensive coverage for his medical care, including specialist visits, diagnostic tests, and prescription medications.
Physical Examination: Upon physical examination, the patient has crackles in the lungs and peripheral edema. Blood pressure is elevated, and heart sounds are irregular.
Diagnosis: The patient is diagnosed with congestive heart failure, a chronic condition in which the heart cannot pump blood effectively to meet the body's needs.
Treatment: The patient is admitted to the hospital for further evaluation and management of his congestive heart failure. Treatment includes diuresis to remove excess fluid, medication management to control blood pressure and heart rate, and oxygen therapy to improve breathing. The patient is also advised to follow a low-sodium diet and to monitor his fluid intake closely.
Follow-Up: The patient is advised to follow up with his primary care physician and cardiologist regularly to monitor his heart function and adjust treatment as necessary. The patient is also referred to cardiac rehabilitation to improve his exercise tolerance and overall cardiovascular health.""")).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+----------+
| text| result|
+----------------------------------------------------------------------------------------------------+----------+
| The patient has VA insurance.|[Military]|
| She is under Medicare insurance|[Medicare]|
|Medical File for John Smith, Male, Age 42\n\nChief Complaint: Patient complains of nausea, vomiti...|[Medicaid]|
|Certainly, here is an example case study for a patient with private insurance:\n\nCase Study for ...| [Private]|
|\nMedical File for John Doe, Male, Age 72\n\nChief Complaint: Patient reports shortness of breath...|[Medicare]|
+----------------------------------------------------------------------------------------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|genericclassifier_sdoh_insurance_type_sbiobert_cased_mli|
|Compatibility:|Healthcare NLP 4.4.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[features]|
|Output Labels:|[prediction]|
|Language:|en|
|Size:|3.4 MB|
|Dependencies:|sbiobert_base_cased_mli|
## References
Internal SDOH project
## Benchmarking
```bash
label precision recall f1-score support
Employer 0.67 0.82 0.74 17
Medicaid 0.89 0.80 0.84 61
Medicare 0.85 0.89 0.87 38
Military 0.76 0.89 0.82 18
Other 0.56 0.45 0.50 11
Private 0.80 0.77 0.79 31
accuracy - - 0.81 176
macro-avg 0.75 0.77 0.76 176
weighted-avg 0.81 0.81 0.81 176
```
---
layout: model
title: Translate Wallisian to English Pipeline
author: John Snow Labs
name: translate_wls_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, wls, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `wls`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_wls_en_xx_2.7.0_2.4_1609688000528.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_wls_en_xx_2.7.0_2.4_1609688000528.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_wls_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_wls_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.wls.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_wls_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: BERT Embeddings (Base Cased)
author: John Snow Labs
name: bert_base_cased
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_cased_en_2.6.0_2.4_1598340336670.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_en_2.6.0_2.4_1598340336670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("bert_base_cased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert.base_cased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_bert_base_cased_embeddings
I [0.43879568576812744, -0.40160006284713745, 0....
love [0.21737590432167053, -0.3865768313407898, -0....
NLP [-0.16226479411125183, -0.053727392107248306, ...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1](https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1)
---
layout: model
title: Multilingual BERT Embeddings (Base Cased)
author: John Snow Labs
name: bert_multi_cased
date: 2020-08-25
task: Embeddings
language: xx
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, xx]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)".
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_multi_cased_xx_2.6.0_2.4_1598341875191.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_multi_cased_xx_2.6.0_2.4_1598341875191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("bert_multi_cased", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love Spark NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("bert_multi_cased", "xx")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love Spark NLP"]
embeddings_df = nlu.load('xx.embed.bert_multi_cased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
xx_embed_bert_multi_cased_embeddings token
[0.31631314754486084, -0.5579454898834229, 0.1... I
[-0.1488783359527588, -0.27264419198036194, -0... love
[0.0496230386197567, -0.43625175952911377, -0.... Spark
[-0.2838578224182129, -0.7103433012962341, 0.4... NLP
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_multi_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[xx]|
|Dimension:|768|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3
---
layout: model
title: Korean BertForQuestionAnswering model (from bespin-global)
author: John Snow Labs
name: bert_qa_klue_bert_base_aihub_mrc
date: 2022-06-02
tags: [ko, open_source, question_answering, bert]
task: Question Answering
language: ko
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `klue-bert-base-aihub-mrc` is a Korean model orginally trained by `bespin-global`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_klue_bert_base_aihub_mrc_ko_4.0.0_3.0_1654188035750.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_klue_bert_base_aihub_mrc_ko_4.0.0_3.0_1654188035750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_klue_bert_base_aihub_mrc","ko") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_klue_bert_base_aihub_mrc","ko")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("ko.answer_question.klue.bert.base_aihub.by_bespin-global").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_klue_bert_base_aihub_mrc|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|ko|
|Size:|413.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/bespin-global/klue-bert-base-aihub-mrc
- https://github.com/KLUE-benchmark/KLUE
- https://www.bespinglobal.com/
- https://aihub.or.kr/aidata/86
---
layout: model
title: Relation Extraction between Tests, Results, and Dates
author: John Snow Labs
name: re_test_result_date
date: 2021-02-24
tags: [licensed, en, clinical, relation_extraction]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 2.7.4
spark_version: 2.4
supported: true
annotator: RelationExtractionModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Relation extraction between lab test names, their findings, measurements, results, and date.
## Predicted Entities
`is_finding_of`, `is_result_of`, `is_date_of`, `O`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_DATE/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb#scrollTo=D8TtVuN-Ee8s){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_test_result_date_en_2.7.4_2.4_1614168615976.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_test_result_date_en_2.7.4_2.4_1614168615976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, PerceptronModel, DependencyParserModel, WordEmbeddingsModel, NerDLModel, NerConverter, RelationExtractionModel
In the table below, `re_test_result_date` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated.
| RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS |
|:-------------------:|:------------------------------------------------------:|:---------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| re_test_result_date | is_finding_of, is_result_of, is_date_of, O | ner_jsl | [“test-test_result”, “test_result-test”, “test-date”, “date-test”, “test-imagingfindings”, “imagingfindings-test”, “test-ekg_findings”, “ekg_findings-test”, “date-test_result”, “test_result-date”, “date-imagingfindings”, “imagingfindings-date”, “date-ekg_findings”, “ekg_findings-date”] |
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
ner_tagger = MedicalNerModel().pretrained('jsl_ner_wip_clinical',"en","clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_chunker = NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
re_model = RelationExtractionModel().pretrained("re_test_result_date", "en", 'clinical/models')\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setMaxSyntacticDistance(4)\
.setPredictionThreshold(0.9)\
.setRelationPairs(["external_body_part_or_region-test"])# Possible relation pairs. Default: All Relations.
nlp_pipeline = Pipeline(stages=[document_assembler, sentencer, tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("""He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""")
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val ner_tagger = MedicalNerModel().pretrained("jsl_ner_wip_clinical","en","clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_chunker = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
val re_model = RelationExtractionModel().pretrained("re_test_result_date", "en", "clinical/models")
.setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies"))
.setOutputCol("relations")
.setMaxSyntacticDistance(4)
.setPredictionThreshold(0.9)
.setRelationPairs("external_body_part_or_region-test")# Possible relation pairs. Default: All Relations.
val nlp_pipeline = new Pipeline().setStages(Array(document_assembler, sentencer, tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model))
val text = """He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%"""
val data = Seq(text).toDS.toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.test_result_date").predict("""He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""")
```
## Results
```bash
| index | relations | entity1 | chunk1 | entity2 | chunk2 |
|-------|--------------|--------------|---------------------|--------------|---------|
| 0 | O | TEST | chest X-ray | MEASUREMENTS | 93% |
| 1 | O | TEST | CT scan | MEASUREMENTS | 93% |
| 2 | is_result_of | TEST | SpO2 | MEASUREMENTS | 93% |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|re_test_result_date|
|Type:|re|
|Compatibility:|Healthcare NLP 2.7.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]|
|Output Labels:|[relations]|
|Language:|en|
## Data Source
Trained on internal data.
## Benchmarking
```bash
| relation | prec |
|-----------------|------|
| O | 0.77 |
| is_finding_of | 0.80 |
| is_result_of | 0.96 |
| is_date_of | 0.94 |
```
---
layout: model
title: English image_classifier_vit_base_mri ViTForImageClassification from raedinkhaled
author: John Snow Labs
name: image_classifier_vit_base_mri
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_mri` is a English model originally trained by raedinkhaled.
## Predicted Entities
`cad`, `healthy`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_mri_en_4.1.0_3.0_1660168752129.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_mri_en_4.1.0_3.0_1660168752129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_base_mri", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_base_mri", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_base_mri|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English DistilBertForQuestionAnswering model (from sunitha)
author: John Snow Labs
name: distilbert_qa_base_uncased_3feb_2022_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-3feb-2022-finetuned-squad` is a English model originally trained by `sunitha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_3feb_2022_finetuned_squad_en_4.0.0_3.0_1654723730880.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_3feb_2022_finetuned_squad_en_4.0.0_3.0_1654723730880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_3feb_2022_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_3feb_2022_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_sunitha").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_3feb_2022_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/sunitha/distilbert-base-uncased-3feb-2022-finetuned-squad
---
layout: model
title: Medical Spell Checker Pipeline
author: John Snow Labs
name: spellcheck_clinical_pipeline
date: 2022-04-14
tags: [spellcheck, medical, medical_spell_checker, spell_corrector, spell_pipeline, en, licensed, clinical]
task: Spell Check
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 2.4
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained medical spellchecker pipeline is built on the top of [spellcheck_clinical](https://nlp.johnsnowlabs.com/2022/04/14/spellcheck_clinical_en_2_4.html) model. This pipeline is for PySpark 2.4.x users with SparkNLP 3.4.1.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_pipeline_en_3.4.1_2.4_1649930943224.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_pipeline_en_3.4.1_2.4_1649930943224.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("spellcheck_clinical_pipeline", "en", "clinical/models")
example = ["Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.",
"With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.",
"Abdomen is sort, nontender, and nonintended.",
"Patient not showing pain or any wealth problems.",
"No cute distress"]
pipeline.fullAnnotate(example)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("spellcheck_clinical_pipeline", "en", "clinical/models")
val example = Array("Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.",
"With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.",
"Abdomen is sort, nontender, and nonintended.",
"Patient not showing pain or any wealth problems.",
"No cute distress")
pipeline.fullAnnotate(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.spell.clinical.pipeline").predict("""Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.""")
```
## Results
```bash
[{'checked': ['With','the','cell','of','physical','therapy','the','patient','was','ambulated','and','on','postoperative',',','the','patient','tolerating','a','post','surgical','soft','diet','.'],
'document': ['Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.'],
'token': ['Witth','the','hell','of','phisical','terapy','the','patient','was','imbulated','and','on','postoperative',',','the','impatient','tolerating','a','post','curgical','soft','diet','.']},
{'checked': ['With','pain','well','controlled','on','oral','pain','medications',',','she','was','discharged','to','rehabilitation','facility','.'],
'document': ['With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.'],
'token': ['With','paint','wel','controlled','on','orall','pain','medications',',','she','was','discharged','too','reihabilitation','facilitay','.']},
{'checked': ['Abdomen','is','soft',',','nontender',',','and','nondistended','.'],
'document': ['Abdomen is sort, nontender, and nonintended.'],
'token': ['Abdomen','is','sort',',','nontender',',','and','nonintended','.']},
{'checked': ['Patient','not','showing','pain','or','any','health','problems','.'],
'document': ['Patient not showing pain or any wealth problems.'],
'token': ['Patient','not','showing','pain','or','any','wealth','problems','.']},
{'checked': ['No', 'acute', 'distress'],
'document': ['No cute distress'],
'token': ['No', 'cute', 'distress']}]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|spellcheck_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|141.3 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ContextSpellCheckerModel
---
layout: model
title: German asr_exp_w2v2t_vp_s946 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: pipeline_asr_exp_w2v2t_vp_s946
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_s946` is a German model originally trained by jonatasgrosman.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2t_vp_s946_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_s946_de_4.2.0_3.0_1664110443438.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_s946_de_4.2.0_3.0_1664110443438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2t_vp_s946', lang = 'de')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2t_vp_s946", lang = "de")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_exp_w2v2t_vp_s946|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Detect Living Species
author: John Snow Labs
name: bert_token_classifier_ner_living_species
date: 2022-06-27
tags: [es, ner, clinical, licensed, bertfortokenclassification]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract living species from clinical texts in Spanish which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP.
It is trained on the [LivingNER](https://temu.bsc.es/livingner/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others.
## Predicted Entities
`HUMAN`, `SPECIES`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_es_3.5.3_3.0_1656316616890.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_es_3.5.3_3.0_1656316616890.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "es", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("ner")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "es", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
.setCaseSensitive(True)
.setMaxSentenceLength(512)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter))
val data = Seq("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.classify.bert_token.ner_living_species").predict("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""")
```
## Results
```bash
+--------------+-------+
|ner_chunk |label |
+--------------+-------+
|Lactante varón|HUMAN |
|familiares |HUMAN |
|personales |HUMAN |
|neonatal |HUMAN |
|legumbres |SPECIES|
|lentejas |SPECIES|
|garbanzos |SPECIES|
|legumbres |SPECIES|
|madre |HUMAN |
|Cacahuete |SPECIES|
|padres |HUMAN |
+--------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_living_species|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|410.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
[https://temu.bsc.es/livingner/](https://temu.bsc.es/livingner/)
## Benchmarking
```bash
label precision recall f1-score support
B-HUMAN 0.96 0.99 0.97 3281
B-SPECIES 0.89 0.94 0.91 3712
I-HUMAN 0.86 0.75 0.80 297
I-SPECIES 0.88 0.90 0.89 1732
micro-avg 0.91 0.94 0.93 9022
macro-avg 0.90 0.89 0.89 9022
weighted-avg 0.91 0.94 0.93 9022
```
---
layout: model
title: Legal Management Clause Binary Classifier
author: John Snow Labs
name: legclf_management_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `management` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `management`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_management_clause_en_1.0.0_3.2_1660123720911.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_management_clause_en_1.0.0_3.2_1660123720911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[management]|
|[other]|
|[other]|
|[management]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_management_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
management 0.96 0.90 0.93 71
other 0.95 0.98 0.96 136
accuracy - - 0.95 207
macro-avg 0.95 0.94 0.95 207
weighted-avg 0.95 0.95 0.95 207
```
---
layout: model
title: Relation Extraction between Tumors and Sizes
author: John Snow Labs
name: re_oncology_size_wip
date: 2022-09-26
tags: [licensed, clinical, oncology, relation_extraction, en]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RelationExtractionModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This relation extraction model links Tumor_Size extractions to their corresponding Tumor_Finding extractions.
## Predicted Entities
`is_size_of`, `O`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_oncology_size_wip_en_4.0.0_3.0_1664230171831.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_oncology_size_wip_en_4.0.0_3.0_1664230171831.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(['document'])\
.setOutputCol('sentence')
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", 'token']) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos_tags")
dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \
.setInputCols(["sentence", "pos_tags", "token"]) \
.setOutputCol("dependencies")
re_model = RelationExtractionModel.pretrained("re_oncology_size_wip", "en", "clinical/models") \
.setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \
.setOutputCol("relation_extraction") \
.setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding"]) \
.setMaxSyntacticDistance(10)
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_model])
data = spark.createDataFrame([["The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos_tags", "token"))
.setOutputCol("dependencies")
val re_model = RelationExtractionModel.pretrained("re_oncology_size_wip", "en", "clinical/models")
.setInputCols(Array("embeddings", "pos_tags", "ner_chunk", "dependencies"))
.setOutputCol("relation_extraction")
.setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding"))
.setMaxSyntacticDistance(10)
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_model))
val data = Seq("The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.oncology.size_wip").predict("""The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_exper6_mesum5", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_exper6_mesum5", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_exper6_mesum5|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|322.3 MB|
---
layout: model
title: English Named Entity Recognition (from CouchCat)
author: John Snow Labs
name: distilbert_ner_ma_ner_v7_distil
date: 2022-05-16
tags: [distilbert, ner, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ma_ner_v7_distil` is a English model orginally trained by `CouchCat`.
## Predicted Entities
`MATR`, `PERS`, `TIME`, `MISC`, `PAD`, `PROD`, `BRND`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_ma_ner_v7_distil_en_3.4.2_3.0_1652721967576.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_ma_ner_v7_distil_en_3.4.2_3.0_1652721967576.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_ma_ner_v7_distil","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_ma_ner_v7_distil","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_ner_ma_ner_v7_distil|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/CouchCat/ma_ner_v7_distil
---
layout: model
title: English BertForQuestionAnswering model (from aodiniz)
author: John Snow Labs
name: bert_qa_bert_uncased_L_4_H_256_A_4_squad2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-256_A-4_squad2` is a English model orginally trained by `aodiniz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_256_A_4_squad2_en_4.0.0_3.0_1654185268072.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_256_A_4_squad2_en_4.0.0_3.0_1654185268072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_256_A_4_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_uncased_L_4_H_256_A_4_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert.uncased_4l_256d_a4a_256d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_uncased_L_4_H_256_A_4_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|42.1 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/aodiniz/bert_uncased_L-4_H-256_A-4_squad2
---
layout: model
title: English BertForQuestionAnswering model (from AnonymousSub)
author: John Snow Labs
name: bert_qa_bert_FT_newsqa
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_FT_newsqa` is a English model orginally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_FT_newsqa_en_4.0.0_3.0_1654185068956.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_FT_newsqa_en_4.0.0_3.0_1654185068956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_FT_newsqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_FT_newsqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.news.bert.ft.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_FT_newsqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/bert_FT_newsqa
---
layout: model
title: English image_classifier_vit_new_york_tokyo_london ViTForImageClassification from Suzana
author: John Snow Labs
name: image_classifier_vit_new_york_tokyo_london
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_new_york_tokyo_london` is a English model originally trained by Suzana.
## Predicted Entities
`London`, `New York`, `Tokyo`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_new_york_tokyo_london_en_4.1.0_3.0_1660171162315.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_new_york_tokyo_london_en_4.1.0_3.0_1660171162315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_new_york_tokyo_london", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_new_york_tokyo_london", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_new_york_tokyo_london|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: French Legal Roberta Embeddings
author: John Snow Labs
name: roberta_base_french_legal
date: 2023-02-16
tags: [fr, french, embeddings, transformer, open_source, legal, tensorflow]
task: Embeddings
language: fr
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-french-roberta-base` is a French model originally trained by `joelito`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_french_legal_fr_4.2.4_3.0_1676580048854.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_french_legal_fr_4.2.4_3.0_1676580048854.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_base_french_legal|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|416.2 MB|
|Case sensitive:|true|
## References
https://huggingface.co/joelito/legal-french-roberta-base
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from tiennvcs)
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_infov
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-infovqa` is a English model originally trained by `tiennvcs`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_infov_en_4.3.0_3.0_1672768056130.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_infov_en_4.3.0_3.0_1672768056130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_infov","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_infov","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_infov|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/tiennvcs/distilbert-base-uncased-finetuned-infovqa
---
layout: model
title: English ElectraForQuestionAnswering model (from ptran74) Version-2
author: John Snow Labs
name: electra_qa_DSPFirst_Finetuning_2
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `DSPFirst-Finetuning-2` is a English model originally trained by `ptran74`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_2_en_4.0.0_3.0_1655919435626.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_2_en_4.0.0_3.0_1655919435626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.electra.finetuning_2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_DSPFirst_Finetuning_2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ptran74/DSPFirst-Finetuning-2
- https://github.gatech.edu/pages/VIP-ITS/textbook_SQuAD_explore/explore/textbookv1.0/textbook/
---
layout: model
title: Оcr small for handwritten text
author: John Snow Labs
name: ocr_small_handwritten
date: 2022-02-17
tags: [en, licensed]
task: OCR Text Detection & Recognition
language: en
nav_key: models
edition: Visual NLP 3.3.3
spark_version: 2.4
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Ocr small model for recognise handwritten text based on TrOcr architecture.
The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR).
The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_handwritten_en_3.3.3_2.4_1645080334390.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_handwritten_en_3.3.3_2.4_1645080334390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
ocr = ImageToTextv2().pretrained("ocr_small_handwritten", "en", "clinical/ocr")
ocr.setInputCols(["image"])
ocr.setOutputCol("text")
result = ocr.transform(image_text_lines_df).collect()
print(result[0].text)
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
ocr = ImageToTextv2().pretrained("ocr_small_handwritten", "en", "clinical/ocr")
ocr.setInputCols(["image"])
ocr.setOutputCol("text")
result = ocr.transform(image_text_lines_df).collect()
print(result[0].text)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ocr_small_handwritten|
|Type:|ocr|
|Compatibility:|Visual NLP 3.3.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|146.7 MB|
---
layout: model
title: Fast Neural Machine Translation Model from Afro-Asiatic languages to English
author: John Snow Labs
name: opus_mt_afa_en
date: 2021-06-01
tags: [open_source, seq2seq, translation, afa, en, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: afa
target languages: en
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_afa_en_xx_3.1.0_2.4_1622562470297.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_afa_en_xx_3.1.0_2.4_1622562470297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_afa_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_afa_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Afro-Asiatic languages.translate_to.English').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_afa_en|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for ICD-10-CM (general 3 character codes - augmented)
author: John Snow Labs
name: sbiobertresolve_icd10cm_generalised_augmented
date: 2023-05-31
tags: [licensed, en, clinical, entity_resolution, icd10cm]
task: Entity Resolution
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD-10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It predicts ICD-10-CM codes up to 3 characters (according to ICD-10-CM code structure the first three characters represent general type of the injury or disease).
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_augmented_en_4.4.2_3.0_1685508789416.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_augmented_en_4.4.2_3.0_1685508789416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(['PROBLEM'])
chunk2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_generalised_augmented","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunk2doc,
sbert_embedder,
icd10_resolver])
data = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList("PROBLEM")
val chunk2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("ner_chunk_doc")
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_generalised_augmented","en", "clinical/models")
.setInputCols("sbert_embeddings")
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunk2doc,
sbert_embedder,
icd10_resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+-------------------------------------+-------+----------+---------------------------------------------------------------------------+-----------------------------------------------------------------+
| ner_chunk| entity|icd10_code| resolutions| all_codes|
+-------------------------------------+-------+----------+---------------------------------------------------------------------------+-----------------------------------------------------------------+
| gestational diabetes mellitus|PROBLEM| O24|[gestational diabetes mellitus [gestational diabetes mellitus], history ...| [O24, Z86, Z87]|
|subsequent type two diabetes mellitus|PROBLEM| O24|[pre-existing type 2 diabetes mellitus [pre-existing type 2 diabetes mel...| [O24, E11, E13, Z86]|
| obesity|PROBLEM| E66|[obesity [obesity, unspecified], obese [body mass index [bmi] 40.0-44.9,...| [E66, Z68, Q13, Z86, E34, H35, Z83, Q55]|
| a body mass index|PROBLEM| Z68|[finding of body mass index [body mass index [bmi] 40.0-44.9, adult], ob...| [Z68, E66, R22, R41, M62, P29, R19, R89, M21]|
| polyuria|PROBLEM| R35|[polyuria [polyuria], polyuric state (disorder) [diabetes insipidus], he...|[R35, E23, R31, R82, N40, E72, O04, R30, R80, E88, N03, P96, N02]|
| polydipsia|PROBLEM| R63|[polydipsia [polydipsia], psychogenic polydipsia [other impulse disorder...|[R63, F63, E23, O40, G47, M79, R06, H53, I44, Q30, I45, R00, M35]|
| poor appetite|PROBLEM| R63|[poor appetite [anorexia], poor feeding [feeding problem of newborn, uns...|[R63, P92, R43, E86, R19, F52, Z72, R06, Z76, R53, R45, F50, R10]|
| vomiting|PROBLEM| R11|[vomiting [vomiting], periodic vomiting [cyclical vomiting, in migraine,...| [R11, G43, P92]|
| a respiratory tract infection|PROBLEM| J98|[respiratory tract infection [other specified respiratory disorders], up...| [J98, J06, A49, J22, J20, Z59, T17, J04, Z13, J18, P28, J39]|
+-------------------------------------+-------+----------+---------------------------------------------------------------------------+-----------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10cm_generalised_augmented|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icd10cm_code]|
|Language:|en|
|Size:|1.4 GB|
|Case sensitive:|false|
---
layout: model
title: English asr_wav2vec2_base_cynthia_tedlium_2500_v2 TFWav2Vec2ForCTC from huyue012
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_cynthia_tedlium_2500_v2` is a English model originally trained by huyue012.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2_en_4.2.0_3.0_1664040558353.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2_en_4.2.0_3.0_1664040558353.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|349.2 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Portuguese asr_bp_voxforge1_xlsr TFWav2Vec2ForCTC from lgris
author: John Snow Labs
name: asr_bp_voxforge1_xlsr
date: 2022-09-26
tags: [wav2vec2, pt, audio, open_source, asr]
task: Automatic Speech Recognition
language: pt
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp_voxforge1_xlsr` is a Portuguese model originally trained by lgris.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp_voxforge1_xlsr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp_voxforge1_xlsr_pt_4.2.0_3.0_1664193020078.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp_voxforge1_xlsr_pt_4.2.0_3.0_1664193020078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_bp_voxforge1_xlsr", "pt")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_bp_voxforge1_xlsr", "pt")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_bp_voxforge1_xlsr|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|pt|
|Size:|1.2 GB|
---
layout: model
title: RxNorm Cd ChunkResolver
author: John Snow Labs
name: chunkresolve_rxnorm_cd_clinical
date: 2021-04-16
tags: [entity_resolution, clinical, licensed, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance.
## Predicted Entities
RxNorm Codes and their normalized definition with `clinical_embeddings`.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_cd_clinical_en_3.0.0_3.0_1618603400196.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_cd_clinical_en_3.0.0_3.0_1618603400196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
rxnorm_resolver = ChunkEntityResolverModel()\
.pretrained('chunkresolve_rxnorm_cd_clinical', 'en', "clinical/models")\
.setEnableLevenshtein(True)\
.setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\
.setInputCols(['token', 'chunk_embeddings'])\
.setOutputCol('rxnorm_resolution')\
.setPoolingStrategy("MAX")
pipeline_rxnorm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver])
model = pipeline_rxnorm.fit(spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."""]]).toDF("text"))
results = model.transform(data)
```
```scala
...
val rxnorm_resolver = ChunkEntityResolverModel()
.pretrained('chunkresolve_rxnorm_cd_clinical', 'en', "clinical/models")
.setEnableLevenshtein(True)
.setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,11,0,0,0,9))
.setInputCols('token', 'chunk_embeddings')
.setOutputCol('rxnorm_resolution')
.setPoolingStrategy("MAX")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
| chunk| entity| target_text| code|confidence|
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
| metformin|TREATMENT|metFORMIN compounding powder:::Metformin Hydrochloride Powder:::metFORMIN 500 mg oral tablet:::me...| 601021| 0.2364|
| glipizide|TREATMENT|Glipizide Powder:::Glipizide Crystal:::Glipizide Tablets:::glipiZIDE 5 mg oral tablet:::glipiZIDE...| 241604| 0.3647|
|dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|Ezetimibe and Atorvastatin Tablets:::Amlodipine and Atorvastatin Tablets:::Atorvastatin Calcium T...|1422084| 0.3407|
| dapagliflozin|TREATMENT|Dapagliflozin Tablets:::dapagliflozin 5 mg oral tablet:::dapagliflozin 10 mg oral tablet:::Dapagl...|1488568| 0.7070|
+---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|chunkresolve_rxnorm_cd_clinical|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[token, chunk_embeddings]|
|Output Labels:|[rxnorm]|
|Language:|en|
---
layout: model
title: English BertForQuestionAnswering model (from ruselkomp)
author: John Snow Labs
name: bert_qa_tests_finetuned_squad_test_bert_2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tests-finetuned-squad-test-bert-2` is a English model orginally trained by `ruselkomp`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tests_finetuned_squad_test_bert_2_en_4.0.0_3.0_1654192311570.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tests_finetuned_squad_test_bert_2_en_4.0.0_3.0_1654192311570.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tests_finetuned_squad_test_bert_2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_tests_finetuned_squad_test_bert_2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.v2.by_ruselkomp").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_tests_finetuned_squad_test_bert_2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.6 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ruselkomp/tests-finetuned-squad-test-bert-2
---
layout: model
title: English image_classifier_vit_generation_xyz ViTForImageClassification from chradden
author: John Snow Labs
name: image_classifier_vit_generation_xyz
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_generation_xyz` is a English model originally trained by chradden.
## Predicted Entities
`Generation Alpha`, `Millennials`, `Generation X`, `Generation Z`, `Baby Boomers`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_generation_xyz_en_4.1.0_3.0_1660171707791.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_generation_xyz_en_4.1.0_3.0_1660171707791.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_generation_xyz", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_generation_xyz", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_generation_xyz|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English DistilBertForTokenClassification Cased model (from Neurona)
author: John Snow Labs
name: distilbert_token_classifier_cpener_test
date: 2023-03-03
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cpener-test` is a English model originally trained by `Neurona`.
## Predicted Entities
`cpe_version`, `cpe_product`, `cpe_vendor`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_cpener_test_en_4.3.0_3.0_1677881384855.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_cpener_test_en_4.3.0_3.0_1677881384855.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_cpener_test","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_cpener_test","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_cpener_test|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Neurona/cpener-test
---
layout: model
title: Fast Neural Machine Translation Model from Multiple languages to English
author: John Snow Labs
name: opus_mt_mul_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, mul, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `mul`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_mul_en_xx_2.7.0_2.4_1609166361160.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_mul_en_xx_2.7.0_2.4_1609166361160.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_mul_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_mul_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.mul.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_mul_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Detect Problems, Tests and Treatments (ner_clinical_large)
author: John Snow Labs
name: ner_clinical_large_pipeline
date: 2023-03-15
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_clinical_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_clinical_large_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_pipeline_en_4.3.0_3.2_1678876271920.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_pipeline_en_4.3.0_3.2_1678876271920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_clinical_large_pipeline", "en", "clinical/models")
text = '''The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_clinical_large_pipeline", "en", "clinical/models")
val text = "The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.clinical_large.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------------------------------------------------|--------:|------:|:------------|-------------:|
| 0 | the G-protein-activated inwardly rectifying potassium (GIRK | 48 | 106 | TREATMENT | 0.6926 |
| 1 | the genomicorganization | 142 | 164 | TREATMENT | 0.80715 |
| 2 | a candidate gene forType II diabetes mellitus | 210 | 254 | PROBLEM | 0.754343 |
| 3 | byapproximately | 380 | 394 | TREATMENT | 0.7924 |
| 4 | single nucleotide polymorphisms | 464 | 494 | TREATMENT | 0.636967 |
| 5 | aVal366Ala substitution | 532 | 554 | PROBLEM | 0.53615 |
| 6 | an 8 base-pair | 561 | 574 | PROBLEM | 0.607733 |
| 7 | insertion/deletion | 581 | 598 | PROBLEM | 0.8692 |
| 8 | Ourexpression studies | 601 | 621 | TEST | 0.89975 |
| 9 | the transcript in various humantissues | 648 | 685 | PROBLEM | 0.83306 |
| 10 | fat andskeletal muscle | 749 | 770 | PROBLEM | 0.778133 |
| 11 | furtherstudies | 830 | 843 | PROBLEM | 0.8789 |
| 12 | the KCNJ9 protein | 864 | 880 | TREATMENT | 0.561033 |
| 13 | evaluation | 892 | 901 | TEST | 0.9981 |
| 14 | Type II diabetes | 940 | 955 | PROBLEM | 0.698967 |
| 15 | the treatment | 1025 | 1037 | TREATMENT | 0.81195 |
| 16 | breast cancer | 1042 | 1054 | PROBLEM | 0.9604 |
| 17 | the standard therapy | 1067 | 1086 | TREATMENT | 0.757767 |
| 18 | anthracyclines | 1125 | 1138 | TREATMENT | 0.9999 |
| 19 | taxanes | 1144 | 1150 | TREATMENT | 0.9999 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_clinical_large_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18)
author: John Snow Labs
name: distilbert_qa_base_uncased_becasv2_2
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becasv2-2` is a English model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_2_en_4.3.0_3.0_1672767690511.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_2_en_4.3.0_3.0_1672767690511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_becasv2_2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Evelyn18/distilbert-base-uncased-becasv2-2
---
layout: model
title: English asr_autonlp_hindi_asr TFWav2Vec2ForCTC from abhishek
author: John Snow Labs
name: asr_autonlp_hindi_asr
date: 2022-09-26
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_autonlp_hindi_asr` is a English model originally trained by abhishek.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_autonlp_hindi_asr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_autonlp_hindi_asr_en_4.2.0_3.0_1664195323489.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_autonlp_hindi_asr_en_4.2.0_3.0_1664195323489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_autonlp_hindi_asr", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_autonlp_hindi_asr", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_autonlp_hindi_asr|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: English RobertaForSequenceClassification Cased model (from jawadhussein462)
author: John Snow Labs
name: roberta_classifier_autotrain_neurips_chanllenge_1287149282
date: 2022-12-09
tags: [en, open_source, roberta, sequence_classification, classification, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
recommended: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-neurips_chanllenge-1287149282` is a English model originally trained by `jawadhussein462`.
## Predicted Entities
`1`, `0`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_neurips_chanllenge_1287149282_en_4.2.4_3.0_1670624021899.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_neurips_chanllenge_1287149282_en_4.2.4_3.0_1670624021899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_neurips_chanllenge_1287149282","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier])
data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_neurips_chanllenge_1287149282","en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier))
val data = Seq("I love you!").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_classifier_autotrain_neurips_chanllenge_1287149282|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/jawadhussein462/autotrain-neurips_chanllenge-1287149282
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_EManuals_RoBERTa_squad2.0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `EManuals_RoBERTa_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_EManuals_RoBERTa_squad2.0_en_4.0.0_3.0_1655726695679.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_EManuals_RoBERTa_squad2.0_en_4.0.0_3.0_1655726695679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_EManuals_RoBERTa_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_EManuals_RoBERTa_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.emanuals.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_EManuals_RoBERTa_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|466.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/EManuals_RoBERTa_squad2.0
---
layout: model
title: German asr_wav2vec2_large_xlsr_53_german_by_facebook TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_53_german_by_facebook
date: 2022-09-24
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_facebook` is a German model originally trained by facebook.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_german_by_facebook_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_facebook_de_4.2.0_3.0_1664026408621.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_facebook_de_4.2.0_3.0_1664026408621.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_53_german_by_facebook", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_53_german_by_facebook", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_53_german_by_facebook|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|332.6 MB|
---
layout: model
title: Detect Radiology Concepts - WIP (biobert)
author: John Snow Labs
name: jsl_rd_ner_wip_greedy_biobert
date: 2021-07-26
tags: [licensed, clinical, en, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.1.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract clinical entities from Radiology reports using pretrained NER model.
## Predicted Entities
`Test_Result`, `OtherFindings`, `BodyPart`, `ImagingFindings`, `Disease_Syndrome_Disorder`, `ImagingTest`, `Measurements`, `Procedure`, `Score`, `Test`, `Medical_Device`, `Direction`, `Symptom`, `Imaging_Technique`, `ManualFix`, `Units`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_biobert_en_3.1.3_3.0_1627305153541.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_biobert_en_3.1.3_3.0_1627305153541.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = BertEmbeddings.pretrained('biobert_pubmed_base_cased') \
.setInputCols(['sentence', 'token']) \
.setOutputCol('embeddings')
clinical_ner = MedicalNerModel.pretrained("jsl_rd_ner_wip_greedy_biobert", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma."]], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("jsl_rd_ner_wip_greedy_biobert", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val data = Seq("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.radiology.wip_greedy_biobert").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""")
```
## Results
```bash
| | chunk | entity |
|---:|:----------------------|:--------------------------|
| 0 | Bilateral | Direction |
| 1 | breast | BodyPart |
| 2 | ultrasound | ImagingTest |
| 3 | ovoid mass | ImagingFindings |
| 4 | 0.5 x 0.5 x 0.4 | Measurements |
| 5 | cm | Units |
| 6 | left | Direction |
| 7 | shoulder | BodyPart |
| 8 | mass | ImagingFindings |
| 9 | isoechoic echotexture | ImagingFindings |
| 10 | muscle | BodyPart |
| 11 | internal color flow | ImagingFindings |
| 12 | benign fibrous tissue | ImagingFindings |
| 13 | lipoma | Disease_Syndrome_Disorder |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|jsl_rd_ner_wip_greedy_biobert|
|Compatibility:|Healthcare NLP 3.1.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
Trained on Dataset annotated by John Snow Labs
## Benchmarking
```bash
label tp fp fn prec rec f1
B-Units 253 7 11 0.9730769 0.9583333 0.9656488
B-Medical_Device 382 109 74 0.7780040 0.8377193 0.8067581
B-BodyPart 2645 347 276 0.8840241 0.9055118 0.8946389
I-BodyPart 645 142 135 0.819568 0.8269231 0.8232291
B-Imaging_Technique 137 36 33 0.7919075 0.8058823 0.7988338
B-Procedure 260 93 130 0.7365439 0.6666667 0.6998653
B-Direction 1573 136 123 0.9204213 0.9274764 0.9239353
I-ImagingTest 30 9 32 0.7692308 0.4838709 0.5940594
I-Test_Result 2 0 0 1 1 1
B-Measurements 452 24 30 0.9495798 0.9377593 0.9436326
B-ImagingFindings 1929 679 542 0.7396472 0.7806556 0.7595984
B-Test 146 17 49 0.8957055 0.7487179 0.8156425
B-ManualFix 2 0 2 1 0.5 0.6666667
I-Procedure 147 91 106 0.6176470 0.5810277 0.598778
I-Imaging_Technique 75 63 26 0.5434782 0.7425743 0.6276151
I-Measurements 45 3 6 0.9375 0.8823529 0.9090909
B-ImagingTest 328 36 85 0.9010989 0.7941888 0.8442728
I-Test 26 9 34 0.7428571 0.4333333 0.5473684
I-Symptom 138 62 142 0.69 0.4928571 0.575
I-ImagingFindings 1348 617 662 0.6860051 0.6706468 0.678239
B-Disease_Syndrome_Disorder 1068 298 243 0.7818448 0.8146453 0.7979080
B-Symptom 523 110 190 0.8262243 0.7335203 0.7771174
I-Disease_Syndrome_Disorder 377 168 171 0.6917431 0.6879562 0.6898445
I-Medical_Device 369 72 62 0.8367347 0.8561485 0.8463302
I-Direction 352 38 41 0.9025641 0.8956743 0.899106
Macro-average 13272 3200 3313 0.7195612 0.6489194 0.6824170
Micro-average 13272 3200 3313 0.8057309 0.8002412 0.8029767
```
---
layout: model
title: Legal Power of attorney Clause Binary Classifier (md)
author: John Snow Labs
name: legclf_power_of_attorney_md
date: 2022-11-25
tags: [en, legal, classification, document, agreement, contract, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `power-of-attorney` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `power-of-attorney`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_power_of_attorney_md_en_1.0.0_3.0_1669376514743.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_power_of_attorney_md_en_1.0.0_3.0_1669376514743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[power-of-attorney]|
|[other]|
|[other]|
|[power-of-attorney]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_power_of_attorney_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
precision recall f1-score support
other 0.89 1.00 0.94 39
power-of-attorney 1.00 0.79 0.88 24
accuracy 0.92 63
macro avg 0.94 0.90 0.91 63
weighted avg 0.93 0.92 0.92 63
```
---
layout: model
title: ALBERT Large CoNNL-03 NER Pipeline
author: John Snow Labs
name: albert_large_token_classifier_conll03_pipeline
date: 2022-04-23
tags: [open_source, ner, token_classifier, albert, conll03, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [albert_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_large_token_classifier_conll03_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_large_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650710898514.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_large_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650710898514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs."))
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PER |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_large_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|64.4 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- AlbertForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from allenai)
author: John Snow Labs
name: t5_small_squad2_next_word_generator_squad
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-squad2-next-word-generator-squad` is a English model originally trained by `allenai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_squad2_next_word_generator_squad_en_4.3.0_3.0_1675155704406.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_squad2_next_word_generator_squad_en_4.3.0_3.0_1675155704406.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_small_squad2_next_word_generator_squad","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_small_squad2_next_word_generator_squad","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_small_squad2_next_word_generator_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|148.1 MB|
## References
- https://huggingface.co/allenai/t5-small-squad2-next-word-generator-squad
---
layout: model
title: Part of Speech for Japanese
author: John Snow Labs
name: pos_ud_gsd
date: 2021-03-09
tags: [part_of_speech, open_source, japanese, pos_ud_gsd, ja]
task: Part of Speech Tagging
language: ja
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- NOUN
- ADP
- VERB
- SCONJ
- AUX
- PUNCT
- PART
- DET
- NUM
- ADV
- PRON
- ADJ
- PROPN
- CCONJ
- SYM
- INTJ
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ja_3.0.0_3.0_1615292368738.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ja_3.0.0_3.0_1615292368738.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_gsd", "ja") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['ジョンスノーラボからこんにちは! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_gsd", "ja")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("ジョンスノーラボからこんにちは! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""ジョンスノーラボからこんにちは! ""]
token_df = nlu.load('ja.pos.ud_gsd').predict(text)
token_df
```
## Results
```bash
token pos
0 ジョンス NOUN
1 ノ NOUN
2 ー NOUN
3 ラ NOUN
4 ボ NOUN
5 から ADP
6 こん NOUN
7 に ADP
8 ち NOUN
9 は ADP
10 ! VERB
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_gsd|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|ja|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from Moussab)
author: John Snow Labs
name: roberta_qa_deepset_base_squad2_orkg_how_5e_05
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-how-5e-05` is a English model originally trained by `Moussab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_how_5e_05_en_4.3.0_3.0_1674209532316.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_how_5e_05_en_4.3.0_3.0_1674209532316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_how_5e_05","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_how_5e_05","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_deepset_base_squad2_orkg_how_5e_05|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-how-5e-05
---
layout: model
title: Pipeline for Adverse Drug Events
author: John Snow Labs
name: explain_clinical_doc_ade
date: 2022-06-30
tags: [en, clinical, licensed, ade, pipeline]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pipeline for Adverse Drug Events (ADE) with `ner_ade_biobert`, `assertion_dl_biobert`, ,`classifierdl_ade_conversational_biobert`, and `re_ade_biobert` . It will classify the document, extract ADE and DRUG clinical entities, assign assertion status to ADE entities, and relate Drugs with their ADEs.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_4.0.0_3.0_1656581944018.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_4.0.0_3.0_1656581944018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("explain_clinical_doc_ade", "en", "clinical/models")
res = pipeline.fullAnnotate("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""")
```
```scala
from sparknlp.pretrained import PretrainedPipeline
val era_pipeline = new PretrainedPipeline("explain_clinical_doc_ade", "en", "clinical/models")
val result = era_pipeline.fullAnnotate("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""")(0)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.explain_doc.clinical_ade").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""")
```
## Results
```bash
Class: True
NER_Assertion:
| | chunk | entitiy | assertion |
|----|-------------------------|------------|-------------|
| 0 | Lipitor | DRUG | - |
| 1 | severe fatigue | ADE | Conditional |
| 2 | voltaren | DRUG | - |
| 3 | cramps | ADE | Conditional |
Relations:
| | chunk1 | entitiy1 | chunk2 | entity2 | relation |
|----|-------------------------------|------------|-------------|---------|----------|
| 0 | severe fatigue | ADE | Lipitor | DRUG | 1 |
| 1 | cramps | ADE | Lipitor | DRUG | 0 |
| 2 | severe fatigue | ADE | voltaren | DRUG | 0 |
| 3 | cramps | ADE | voltaren | DRUG | 1 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_clinical_doc_ade|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|484.6 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- BertEmbeddings
- SentenceEmbeddings
- ClassifierDLModel
- MedicalNerModel
- NerConverterInternalModel
- PerceptronModel
- DependencyParserModel
- RelationExtractionModel
- NerConverterInternalModel
- AssertionDLModel
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739275203.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739275203.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base_rule_based_quadruplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|460.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0
---
layout: model
title: Chinese BertForMaskedLM Base Cased model (from ptrsxu)
author: John Snow Labs
name: bert_embeddings_ptrsxu_base_chinese
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-chinese` is a Chinese model originally trained by `ptrsxu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_ptrsxu_base_chinese_zh_4.2.4_3.0_1670016435043.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_ptrsxu_base_chinese_zh_4.2.4_3.0_1670016435043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_ptrsxu_base_chinese","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_ptrsxu_base_chinese","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_ptrsxu_base_chinese|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|383.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/ptrsxu/bert-base-chinese
- https://aclanthology.org/2021.acl-long.330.pdf
- https://dl.acm.org/doi/pdf/10.1145/3442188.3445922
---
layout: model
title: Fast Neural Machine Translation Model from Spanish to English
author: John Snow Labs
name: opus_mt_es_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, es, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `es`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_es_en_xx_2.7.0_2.4_1609165183882.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_es_en_xx_2.7.0_2.4_1609165183882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_es_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_es_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.es.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_es_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering Uncased model (from roshnir)
author: John Snow Labs
name: bert_qa_multi_uncased_trained_squadv2
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-uncased-trained-squadv2` is a English model originally trained by `roshnir`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multi_uncased_trained_squadv2_en_4.0.0_3.0_1657187851046.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multi_uncased_trained_squadv2_en_4.0.0_3.0_1657187851046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multi_uncased_trained_squadv2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_multi_uncased_trained_squadv2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_multi_uncased_trained_squadv2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|626.2 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/roshnir/bert-multi-uncased-trained-squadv2
- https://aclanthology.org/2020.acl-main.421.pdf%5D
---
layout: model
title: Tagalog Electra Embeddings (from jcblaise)
author: John Snow Labs
name: electra_embeddings_electra_tagalog_small_uncased_generator
date: 2022-05-17
tags: [tl, open_source, electra, embeddings]
task: Embeddings
language: tl
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-tagalog-small-uncased-generator` is a Tagalog model orginally trained by `jcblaise`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_small_uncased_generator_tl_3.4.4_3.0_1652786769151.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_small_uncased_generator_tl_3.4.4_3.0_1652786769151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_small_uncased_generator","tl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Mahilig ako sa Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_small_uncased_generator","tl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Mahilig ako sa Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_electra_tagalog_small_uncased_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|tl|
|Size:|18.9 MB|
|Case sensitive:|false|
## References
- https://huggingface.co/jcblaise/electra-tagalog-small-uncased-generator
- https://blaisecruz.com
---
layout: model
title: English asr_wav2vec2_base_100h_by_facebook TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_100h_by_facebook
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_by_facebook` is a English model originally trained by facebook.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_by_facebook_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_by_facebook_en_4.2.0_3.0_1664038682928.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_by_facebook_en_4.2.0_3.0_1664038682928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_by_facebook', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_by_facebook", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_100h_by_facebook|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|227.9 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18)
author: John Snow Labs
name: distilbert_qa_base_uncased_becas_2
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-2` is a English model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_2_en_4.3.0_3.0_1672767456835.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_2_en_4.3.0_3.0_1672767456835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_becas_2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|243.9 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-2
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10_en_4.3.0_3.0_1674215893984.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10_en_4.3.0_3.0_1674215893984.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|419.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-10
---
layout: model
title: Named Entity Recognition Profiling (Clinical)
author: John Snow Labs
name: ner_profiling_clinical
date: 2022-01-18
tags: [ner, ner_profiling, clinical, en, licensed]
task: [Named Entity Recognition, Pipeline Healthcare]
language: en
nav_key: models
edition: Healthcare NLP 3.3.1
spark_version: 2.4
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `embeddings_clinical`. It has been updated by adding new clinical NER model and NER model outputs to the previous version.
Here are the NER models that this pretrained pipeline includes: `ner_ade_clinical`, `ner_posology_greedy`, `ner_risk_factors`, `jsl_ner_wip_clinical`, `ner_human_phenotype_gene_clinical`, `jsl_ner_wip_greedy_clinical`, `ner_cellular`, `ner_cancer_genetics`, `jsl_ner_wip_modifier_clinical`, `ner_drugs_greedy`, `ner_deid_sd_large`, `ner_diseases`, `nerdl_tumour_demo`, `ner_deid_subentity_augmented`, `ner_jsl_enriched`, `ner_genetic_variants`, `ner_bionlp`, `ner_measurements_clinical`, `ner_diseases_large`, `ner_radiology`, `ner_deid_augmented`, `ner_anatomy`, `ner_chemprot_clinical`, `ner_posology_experimental`, `ner_drugs`, `ner_deid_sd`, `ner_posology_large`, `ner_deid_large`, `ner_posology`, `ner_deidentify_dl`, `ner_deid_enriched`, `ner_bacterial_species`, `ner_drugs_large`, `ner_clinical_large`, `jsl_rd_ner_wip_greedy_clinical`, `ner_medmentions_coarse`, `ner_radiology_wip_clinical`, `ner_clinical`, `ner_chemicals`, `ner_deid_synthetic`, `ner_events_clinical`, `ner_posology_small`, `ner_anatomy_coarse`, `ner_human_phenotype_go_clinical`, `ner_jsl_slim`, `ner_jsl`, `ner_jsl_greedy`, `ner_events_admission_clinical`, `ner_chexpert` .
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.2.Pretrained_NER_Profiling_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_3.3.1_2.4_1642496753293.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_3.3.1_2.4_1642496753293.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
ner_profiling_pipeline = PretrainedPipeline('ner_profiling_clinical', 'en', 'clinical/models')
result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val ner_profiling_pipeline = PretrainedPipeline('ner_profiling_clinical', 'en', 'clinical/models')
val result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.profiling_clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_talbanken", "sv")\
.setInputCols(["document", "token"])\
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
posTagger
])
example = spark.createDataFrame([["' Medicinsk bildtolk ' också skall fungera som hjälpmedel för läkaren att klarlägga sjukdomsbilden utan att patienten behöver säga ett ord ."]], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_talbanken", "sv")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer , pos))
val data = Seq(" Medicinsk bildtolk " också skall fungera som hjälpmedel för läkaren att klarlägga sjukdomsbilden utan att patienten behöver säga ett ord .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""' Medicinsk bildtolk ' också skall fungera som hjälpmedel för läkaren att klarlägga sjukdomsbilden utan att patienten behöver säga ett ord .""]
token_df = nlu.load('sv.pos.talbanken').predict(text)
token_df
```
## Results
```bash
+---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
|text |result |
+---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
|' Medicinsk bildtolk ' också skall fungera som hjälpmedel för läkaren att klarlägga sjukdomsbilden utan att patienten behöver säga ett ord . |[PUNCT, ADJ, NOUN, PUNCT, ADV, AUX, VERB, SCONJ, NOUN, ADP, NOUN, PART, VERB, NOUN, ADP, SCONJ, NOUN, AUX, VERB, DET, NOUN, PUNCT]|
+---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_talbanken|
|Compatibility:|Spark NLP 2.7.5+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[pos]|
|Language:|sv|
## Data Source
The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set.
## Benchmarking
```bash
| | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| ADJ | 0.88 | 0.89 | 0.89 | 1826 |
| ADP | 0.96 | 0.96 | 0.96 | 2298 |
| ADV | 0.91 | 0.87 | 0.89 | 1528 |
| AUX | 0.91 | 0.93 | 0.92 | 1021 |
| CCONJ | 0.95 | 0.94 | 0.94 | 791 |
| DET | 0.92 | 0.95 | 0.93 | 1015 |
| INTJ | 1.00 | 0.33 | 0.50 | 3 |
| NOUN | 0.94 | 0.95 | 0.95 | 4711 |
| NUM | 0.98 | 0.96 | 0.97 | 357 |
| PART | 0.93 | 0.94 | 0.94 | 406 |
| PRON | 0.94 | 0.91 | 0.92 | 1449 |
| PROPN | 0.88 | 0.83 | 0.85 | 243 |
| PUNCT | 0.97 | 0.98 | 0.98 | 2104 |
| SCONJ | 0.86 | 0.82 | 0.84 | 491 |
| SYM | 0.50 | 1.00 | 0.67 | 1 |
| VERB | 0.90 | 0.90 | 0.90 | 2142 |
| accuracy | | | 0.93 | 20386 |
| macro avg | 0.90 | 0.89 | 0.88 | 20386 |
| weighted avg | 0.93 | 0.93 | 0.93 | 20386 |
```
---
layout: model
title: Detect Adverse Drug Events (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_ade
date: 2021-09-30
tags: [adverse, ade, bertfortokenclassification, ner, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.2.2
spark_version: 2.4
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Detect adverse reactions of drugs in reviews, tweets, and medical text using the pretrained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP.
## Predicted Entities
`DRUG`, `ADE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_en_3.2.2_2.4_1633008677011.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_en_3.2.2_2.4_1633008677011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_ade", "en", "clinical/models")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter])
p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))
test_sentence = """Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps"""
result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]})))
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_ade", "en", "clinical/models")
.setInputCols(Array("token", "document"))
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))
val data = Seq("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.ner_ade").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|Lipitor |DRUG |
|severe fatigue|ADE |
|voltaren |DRUG |
|cramps |ADE |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_ade|
|Compatibility:|Healthcare NLP 3.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Case sensitive:|true|
|Max sentense length:|512|
## Data Source
This model is trained on a custom dataset by John Snow Labs.
## Benchmarking
```bash
label precision recall f1-score support
B-ADE 0.93 0.79 0.85 2694
B-DRUG 0.97 0.87 0.92 9539
I-ADE 0.93 0.73 0.82 3236
I-DRUG 0.95 0.82 0.88 6115
accuracy - - 0.83 21584
macro-avg 0.84 0.84 0.84 21584
weighted-avg 0.95 0.83 0.89 21584
```
---
layout: model
title: Financial Assertion Status (Negation)
author: John Snow Labs
name: finassertion_negation
date: 2023-01-01
tags: [negation, en, licensed]
task: Assertion Status
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Financial Negation model, aimed to identify if an NER entity is mentioned in the context to be negated or not.
## Predicted Entities
`positive`, `negative`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finassertion_negation_en_1.0.0_3.0_1672578587267.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finassertion_negation_en_1.0.0_3.0_1672578587267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
import pyspark.sql.functions as F
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
finassertion = finance.AssertionDLModel.pretrained("finassertion_negation", "en", "finance/models")\
.setInputCols(["sentence", "ner_chunk", "embeddings"])\
.setOutputCol("finlabel")
pipe = nlp.Pipeline(stages = [ document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter, finassertion])
text = "Gradio INC will not be entering into a joint agreement with Hugging Face, Inc."
sdf = spark.createDataFrame([[text]]).toDF("text")
res = pipe.fit(sdf).transform(sdf)
res.select(F.explode(F.arrays_zip(res.ner_chunk.result,
res.finlabel.result)).alias("cols"))\
.select(F.expr("cols['0']").alias("ner_chunk"),
F.expr("cols['1']").alias("assertion")).show(200, truncate=100)
```
## Results
```bash
+-----------------+---------+
| ner_chunk|assertion|
+-----------------+---------+
| Gradio INC| negative|
|Hugging Face, Inc| positive|
+-----------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finassertion_negation|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, chunk, embeddings]|
|Output Labels:|[assertion]|
|Language:|en|
|Size:|2.2 MB|
## References
In-house annotated legal sentences
## Benchmarking
```bash
label tp fp fn prec rec f1
negative 26 0 1 1.0 0.962963 0.9811321
positive 38 1 0 0.974359 1.0 0.987013
Macro-average 641 1 1 0.9871795 0.9814815 0.9843222
Micro-average 0.9846154 0.9846154 0.9846154
```
---
layout: model
title: English BertForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0_en_4.0.0_3.0_1657192274652.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0_en_4.0.0_3.0_1657192274652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|384.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-0
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2_en_4.0.0_3.0_1655732129289.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2_en_4.0.0_3.0_1655732129289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_256d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|427.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-2
---
layout: model
title: Legal Services Clause Binary Classifier
author: John Snow Labs
name: legclf_services_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `services` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `services`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_services_clause_en_1.0.0_3.2_1660123991970.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_services_clause_en_1.0.0_3.2_1660123991970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[services]|
|[other]|
|[other]|
|[services]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_services_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.97 0.95 0.96 78
services 0.89 0.94 0.91 34
accuracy - - 0.95 112
macro-avg 0.93 0.94 0.94 112
weighted-avg 0.95 0.95 0.95 112
```
---
layout: model
title: Icelandic NER Pipeline
author: John Snow Labs
name: roberta_token_classifier_icelandic_ner_pipeline
date: 2022-06-25
tags: [open_source, ner, token_classifier, roberta, icelandic, is]
task: Named Entity Recognition
language: is
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [roberta_token_classifier_icelandic_ner](https://nlp.johnsnowlabs.com/2021/12/06/roberta_token_classifier_icelandic_ner_is.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_icelandic_ner_pipeline_is_4.0.0_3.0_1656122302435.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_icelandic_ner_pipeline_is_4.0.0_3.0_1656122302435.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("roberta_token_classifier_icelandic_ner_pipeline", lang = "is")
pipeline.annotate("Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.")
```
```scala
val pipeline = new PretrainedPipeline("roberta_token_classifier_icelandic_ner_pipeline", lang = "is")
pipeline.annotate("Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.")
```
## Results
```bash
+----------------+------------+
|chunk |ner_label |
+----------------+------------+
|Peter Fergusson |Person |
|New York |Location |
|október 2011 |Date |
|Tesla Motor |Organization|
|100K $ |Money |
+----------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_token_classifier_icelandic_ner_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|is|
|Size:|457.5 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- RoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from yohein)
author: John Snow Labs
name: distilbert_qa_yohein_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `yohein`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_yohein_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773298911.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_yohein_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773298911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_yohein_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_yohein_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_yohein_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/yohein/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Pipeline to Detect biological concepts (biobert)
author: John Snow Labs
name: ner_bionlp_biobert_pipeline
date: 2023-03-20
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_bionlp_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_bionlp_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_biobert_pipeline_en_4.3.0_3.2_1679313010526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_biobert_pipeline_en_4.3.0_3.2_1679313010526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_bionlp_biobert_pipeline", "en", "clinical/models")
text = '''Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_bionlp_biobert_pipeline", "en", "clinical/models")
val text = "Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay"
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.bionlp_biobert.pipeline").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_radiology_pipeline", "en", "clinical/models")
text = '''Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_radiology_pipeline", "en", "clinical/models")
val text = "Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.radiology.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:-----------------------------------------|--------:|------:|:--------------------------|-------------:|
| 0 | Bilateral breast | 0 | 15 | BodyPart | 0.945 |
| 1 | ultrasound | 17 | 26 | ImagingTest | 0.6734 |
| 2 | ovoid mass | 78 | 87 | ImagingFindings | 0.6095 |
| 3 | 0.5 x 0.5 x 0.4 | 113 | 127 | Measurements | 0.98158 |
| 4 | cm | 129 | 130 | Units | 0.9696 |
| 5 | anteromedial aspect of the left shoulder | 163 | 202 | BodyPart | 0.750517 |
| 6 | mass | 210 | 213 | ImagingFindings | 0.9711 |
| 7 | isoechoic echotexture | 228 | 248 | ImagingFindings | 0.80105 |
| 8 | muscle | 266 | 271 | BodyPart | 0.7963 |
| 9 | internal color flow | 294 | 312 | ImagingFindings | 0.477233 |
| 10 | benign fibrous tissue | 334 | 354 | ImagingFindings | 0.524067 |
| 11 | lipoma | 361 | 366 | Disease_Syndrome_Disorder | 0.6081 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_radiology_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Translate Italian to English Pipeline
author: John Snow Labs
name: translate_it_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, it, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `it`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_it_en_xx_2.7.0_2.4_1609689489920.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_it_en_xx_2.7.0_2.4_1609689489920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_it_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_it_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.it.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_it_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Portuguese BertForTokenClassification Cased model (from pucpr)
author: John Snow Labs
name: bert_token_classifier_clinicalnerpt_diagnostic
date: 2022-11-30
tags: [pt, open_source, bert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: pt
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `clinicalnerpt-diagnostic` is a Portuguese model originally trained by `pucpr`.
## Predicted Entities
`DiagnosticProcedure`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_diagnostic_pt_4.2.4_3.0_1669822359730.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_diagnostic_pt_4.2.4_3.0_1669822359730.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_diagnostic","pt") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_diagnostic","pt")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_clinicalnerpt_diagnostic|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|pt|
|Size:|665.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/pucpr/clinicalnerpt-diagnostic
- https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/
- https://github.com/HAILab-PUCPR/SemClinBr
- https://github.com/HAILab-PUCPR/BioBERTpt
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `recipe_triplet_recipes-roberta-base_EASY_TIMESTEP_squadv2_epochs_3` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3_en_4.3.0_3.0_1674212108351.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3_en_4.3.0_3.0_1674212108351.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|467.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/recipe_triplet_recipes-roberta-base_EASY_TIMESTEP_squadv2_epochs_3
---
layout: model
title: Lemmatizer (Dutch, SpacyLookup)
author: John Snow Labs
name: lemma_spacylookup
date: 2022-03-03
tags: [open_source, lemmatizer, nl]
task: Lemmatization
language: nl
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Dutch Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_nl_3.4.1_3.0_1646316580197.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_nl_3.4.1_3.0_1646316580197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","nl") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["Je bent niet beter dan ik"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","nl")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer))
val data = Seq("Je bent niet beter dan ik").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("nl.lemma").predict("""Je bent niet beter dan ik""")
```
## Results
```bash
+--------------------------------+
|result |
+--------------------------------+
|[Je, bent, niet, beter, dan, ik]|
+--------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma_spacylookup|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|nl|
|Size:|2.5 MB|
---
layout: model
title: Detect Problems, Tests and Treatments (ner_clinical_large)
author: John Snow Labs
name: ner_clinical
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
`PROBLEM`, `TEST`, `TREATMENT`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_en_3.0.0_3.0_1617208419368.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_en_3.0.0_3.0_1617208419368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.']], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.clinical").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_electricidad_small_finetuned_squadv1","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_electricidad_small_finetuned_squadv1","es")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squad.electra.small").predict("""¿Cuál es mi nombre?|||"Mi nombre es Clara y vivo en Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_electricidad_small_finetuned_squadv1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|es|
|Size:|51.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/electricidad-small-finetuned-squadv1-es
- https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1
---
layout: model
title: Smaller BERT Sentence Embeddings (L-4_H-512_A-8)
author: John Snow Labs
name: sent_small_bert_L4_512
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_512_en_2.6.0_2.4_1598350568942.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_512_en_2.6.0_2.4_1598350568942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_512", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_512", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.small_bert_L4_512').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
en_embed_sentence_small_bert_L4_512_embeddings sentence
[0.2255069762468338, 0.14144930243492126, 0.67... I hate cancer
[-0.5351444482803345, 0.36734339594841003, 0.1... Antibiotics aren't painkiller
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_small_bert_L4_512|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|512|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
---
layout: model
title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP TFWav2Vec2ForCTC from Finnish-NLP
author: John Snow Labs
name: asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP` is a Finnish model originally trained by Finnish-NLP.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664025529159.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664025529159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP", "fi")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP", "fi")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|fi|
|Size:|3.6 GB|
---
layout: model
title: Pipeline to Detect Adverse Drug Events (healthcare)
author: John Snow Labs
name: ner_ade_healthcare_pipeline
date: 2022-03-22
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_ade_healthcare](https://nlp.johnsnowlabs.com/2021/04/01/ner_ade_healthcare_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange.button-orange-trans.arr.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_healthcare_pipeline_en_3.4.1_3.0_1647944180015.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_healthcare_pipeline_en_3.4.1_3.0_1647944180015.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_ade_healthcare_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Been taking Lipitor for 15 years, have experienced severe fatigue a lot!!!. Doctor moved me to voltaren 2 months ago, so far, have only experienced cramps")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_ade_healthcare_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Been taking Lipitor for 15 years, have experienced severe fatigue a lot!!!. Doctor moved me to voltaren 2 months ago, so far, have only experienced cramps")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.healthcare_ade.pipeline").predict("""Been taking Lipitor for 15 years, have experienced severe fatigue a lot!!!. Doctor moved me to voltaren 2 months ago, so far, have only experienced cramps""")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|Lipitor |DRUG |
|severe fatigue|ADE |
|voltaren |DRUG |
|cramps |ADE |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_ade_healthcare_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|513.5 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Legal Participations Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_participations_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, participations, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Participations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Participations`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_participations_bert_en_1.0.0_3.0_1678050674961.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_participations_bert_en_1.0.0_3.0_1678050674961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Participations]|
|[Other]|
|[Other]|
|[Participations]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_participations_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.93 0.96 0.95 73
Participations 0.94 0.91 0.92 53
accuracy - - 0.94 126
macro-avg 0.94 0.93 0.93 126
weighted-avg 0.94 0.94 0.94 126
```
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2_en_4.3.0_3.0_1674215491724.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2_en_4.3.0_3.0_1674215491724.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|432.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-2
---
layout: model
title: Stopwords Remover for Ligurian language (162 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, lij, open_source]
task: Stop Words Removal
language: lij
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_lij_3.4.1_3.0_1646653713114.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_lij_3.4.1_3.0_1646653713114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","lij") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Unde é a marina?"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","lij")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Unde é a marina?").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("lij.stopwords").predict("""Unde é a marina?""")
```
## Results
```bash
+-----------------+
|result |
+-----------------+
|[Unde, marina, ?]|
+-----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|lij|
|Size:|1.8 KB|
---
layout: model
title: English BertForQuestionAnswering model (from haddadalwi)
author: John Snow Labs
name: bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-finetuned-squad-finetuned-islamic-squad` is a English model orginally trained by `haddadalwi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad_en_4.0.0_3.0_1654537141717.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad_en_4.0.0_3.0_1654537141717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.large_uncased.by_haddadalwi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/haddadalwi/bert-large-uncased-whole-word-masking-finetuned-squad-finetuned-islamic-squad
---
layout: model
title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes
author: John Snow Labs
name: sbiobertresolve_icd10cm_augmented_billable_hcc
date: 2021-11-01
tags: [icd10cm, hcc, entity_resolution, licensed, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.1
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings and it supports 7-digit codes with HCC status. It has been updated by dropping the invalid codes that exist in the previous versions. In the result, look for the `all_k_aux_labels` parameter in the metadata to get HCC status. The HCC status can be divided to get further information: `billable status`, `hcc status`, and `hcc score`.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_3.3.1_2.4_1635784379929.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_3.3.1_2.4_1635784379929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_icd10cm_augmented_billable_hcc``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. ```PROBLEM``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
c2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sentence_embeddings")
icd_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models") \
.setInputCols(["ner_chunk", "sentence_embeddings"]) \
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN")
resolver_pipeline = Pipeline(
stages = [
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter_icd,
c2doc,
sbert_embedder,
icd_resolver
])
data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text")
results = resolver_pipeline.fit(data_ner).transform(data_ner)
```
```scala
...
val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10cm.augmented_billable").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""")
```
## Results
```bash
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ner_chunk| entity|icd10cm_code| resolutions| all_codes| billable_hcc|
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| gestational diabetes mellitus|PROBLEM| O2441|gestational diabetes mellitus:::postpartum gestational diabetes mel...| O2441:::O2443:::Z8632:::Z875:::O2431:::O2411:::O244:::O241:::O2481|0||0||0:::0||0||0:::1||0||0:::0||0||0:::0||0||0:::0||0||0:::0||0||0...|
|subsequent type two diabetes mellitus|PROBLEM| O2411|pre-existing type 2 diabetes mellitus:::disorder associated with ty...|O2411:::E118:::E11:::E139:::E119:::E113:::E1144:::Z863:::Z8639:::E1...|0||0||0:::1||1||18:::0||0||0:::1||1||19:::1||1||19:::0||0||0:::1||1...|
| T2DM|PROBLEM| E11|t2dm [type 2 diabetes mellitus]:::tndm2:::t2 category:::sma2:::nf2:...|E11:::P702:::C801:::G121:::Q850:::C779:::C509:::C439:::E723:::C5700...|0||0||0:::1||0||0:::1||1||12:::1||1||72:::0||0||0:::1||1||10:::0||0...|
| HTG-induced pancreatitis|PROBLEM| K8520|alcohol-induced pancreatitis:::pancreatitis:::drug induced acute pa...|K8520:::K859:::K853:::K8590:::K85:::F102:::K858:::K8591:::K852:::K8...|1||0||0:::0||0||0:::0||0||0:::1||0||0:::0||0||0:::0||0||0:::0||0||0...|
| acute hepatitis|PROBLEM| K720|acute hepatitis:::acute hepatitis a:::acute infectious hepatitis:::...|K720:::B15:::B179:::B172:::Z0389:::B159:::B150:::B16:::K752:::K712:...|0||0||0:::0||0||0:::1||0||0:::1||0||0:::1||0||0:::1||0||0:::1||0||0...|
| obesity|PROBLEM| E669|obesity:::abdominal obesity:::obese:::central obesity:::overweight ...|E669:::E668:::Z6841:::Q130:::E66:::E6601:::Z8639:::E349:::H3550:::Z...|1||0||0:::1||0||0:::1||1||22:::1||0||0:::0||0||0:::1||1||22:::1||0|...|
| a body mass index|PROBLEM| Z6841|finding of body mass index:::observation of body mass index:::mass ...|Z6841:::E669:::R229:::Z681:::R223:::R221:::Z68:::R222:::R220:::R418...|1||1||22:::1||0||0:::1||0||0:::1||0||0:::0||0||0:::1||0||0:::0||0||...|
| polyuria|PROBLEM| R35|polyuria:::polyuric state:::polyuric state (disorder):::hematuria::...|R35:::R358:::E232:::R31:::R350:::R8299:::N401:::E723:::O048:::R300:...|0||0||0:::1||0||0:::1||1||23:::0||0||0:::1||0||0:::0||0||0:::1||0||...|
| polydipsia|PROBLEM| R631|polydipsia:::psychogenic polydipsia:::primary polydipsia:::psychoge...|R631:::F6389:::E232:::F639:::O40:::G475:::M7989:::R632:::R061:::H53...|1||0||0:::1||1||nan:::1||1||23:::1||1||nan:::0||0||0:::0||0||0:::1|...|
| poor appetite|PROBLEM| R630|poor appetite:::poor feeding:::bad taste in mouth:::unpleasant tast...|R630:::P929:::R438:::R432:::E86:::R196:::F520:::Z724:::R0689:::Z768...|1||0||0:::1||0||0:::1||0||0:::1||0||0:::0||0||0:::1||0||0:::1||0||0...|
| vomiting|PROBLEM| R111|vomiting:::intermittent vomiting:::vomiting symptoms:::periodic vom...| R111:::R11:::R1110:::G43A1:::P921:::P9209:::G43A:::R1113:::R110|0||0||0:::0||0||0:::1||0||0:::1||1||nan:::1||0||0:::1||0||0:::0||0|...|
| a respiratory tract infection|PROBLEM| J988|respiratory tract infection:::upper respiratory tract infection:::b...|J988:::J069:::A499:::J22:::J209:::Z593:::T17:::J0410:::Z1383:::J189...|1||0||0:::1||0||0:::1||0||0:::1||0||0:::1||0||0:::1||0||0:::0||0||0...|
+-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10cm_augmented_billable_hcc|
|Compatibility:|Healthcare NLP 3.3.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icd10cm_code]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on 01 November 2021 ICD10CM Dataset.
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465525
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465525` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465525_en_4.0.0_3.0_1655987202658.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465525_en_4.0.0_3.0_1655987202658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465525","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465525","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465525.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465525|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|887.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465525
---
layout: model
title: Bangla BertForMaskedLM Base Cased model (from sagorsarker)
author: John Snow Labs
name: bert_embeddings_bangla_base
date: 2022-12-02
tags: [bn, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: bn
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bangla-bert-base` is a Bangla model originally trained by `sagorsarker`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bangla_base_bn_4.2.4_3.0_1670015550585.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bangla_base_bn_4.2.4_3.0_1670015550585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_bangla_base","bn") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_bangla_base","bn")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained(lang="en") \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
document_classifier = ClassifierDLModel.pretrained('classifierdl_use_trec50', 'en') \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")
nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate('When did the construction of stone circles begin in the UK?')
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val use = UniversalSentenceEncoder.pretrained(lang="en")
.setInputCols(Array("document"))
.setOutputCol("sentence_embeddings")
val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_trec50", "en")
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier))
val data = Seq("When did the construction of stone circles begin in the UK?").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""When did the construction of stone circles begin in the UK?"""]
trec50_df = nlu.load('en.classify.trec50.use').predict(text, output_level = "document")
trec50_df[["document", "trec50"]]
```
## Results
```bash
+------------------------------------------------------------------------------------------------+------------+
|document |class |
+------------------------------------------------------------------------------------------------+------------+
|When did the construction of stone circles begin in the UK? | NUM_date |
+------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_use_trec50|
|Compatibility:|Spark NLP 2.7.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
## Data Source
This model is trained on the 50 class version of the TREC dataset. http://search.r-project.org/library/textdata/html/dataset_trec.html
---
layout: model
title: Legal Letters of credit Clause Binary Classifier
author: John Snow Labs
name: legclf_letters_of_credit_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `letters-of-credit` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `letters-of-credit`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_letters_of_credit_clause_en_1.0.0_3.2_1660122609389.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_letters_of_credit_clause_en_1.0.0_3.2_1660122609389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[letters-of-credit]|
|[other]|
|[other]|
|[letters-of-credit]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_letters_of_credit_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
letters-of-credit 0.91 0.88 0.89 24
other 0.96 0.98 0.97 84
accuracy - - 0.95 108
macro-avg 0.94 0.93 0.93 108
weighted-avg 0.95 0.95 0.95 108
```
---
layout: model
title: Legal Delegation of duties Clause Binary Classifier
author: John Snow Labs
name: legclf_delegation_of_duties_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `delegation-of-duties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `delegation-of-duties`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_delegation_of_duties_clause_en_1.0.0_3.2_1660122337035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_delegation_of_duties_clause_en_1.0.0_3.2_1660122337035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[delegation-of-duties]|
|[other]|
|[other]|
|[delegation-of-duties]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_delegation_of_duties_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
delegation-of-duties 0.97 0.94 0.95 33
other 0.98 0.99 0.99 103
accuracy - - 0.98 136
macro-avg 0.97 0.96 0.97 136
weighted-avg 0.98 0.98 0.98 136
```
---
layout: model
title: Bangla RobertaForSequenceClassification Cased model (from neuralspace)
author: John Snow Labs
name: roberta_classifier_autotrain_citizen_nlu_bn_1370652766
date: 2022-12-09
tags: [bn, open_source, roberta, sequence_classification, classification, tensorflow]
task: Text Classification
language: bn
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-citizen_nlu_bn-1370652766` is a Bangla model originally trained by `neuralspace`.
## Predicted Entities
`ReportingMissingPets`, `EligibilityForBloodDonationCovidGap`, `ReportingPropertyTakeOver`, `IntentForBloodReceivalAppointment`, `EligibilityForBloodDonationSTD`, `InquiryForDoctorConsultation`, `InquiryOfCovidSymptoms`, `InquiryForVaccineCount`, `InquiryForCovidPrevention`, `InquiryForVaccinationRequirements`, `EligibilityForBloodDonationForPregnantWomen`, `ReportingCyberCrime`, `ReportingHitAndRun`, `ReportingTresspassing`, `InquiryofBloodDonationRequirements`, `ReportingMurder`, `ReportingVehicleAccident`, `ReportingMissingPerson`, `EligibilityForBloodDonationAgeLimit`, `ReportingAnimalPoaching`, `InquiryOfEmergencyContact`, `InquiryForQuarantinePeriod`, `ContactRealPerson`, `IntentForBloodDonationAppointment`, `ReportingMissingVehicle`, `InquiryForCovidRecentCasesCount`, `InquiryOfContact`, `StatusOfFIR`, `InquiryofVaccinationAgeLimit`, `InquiryForCovidTotalCasesCount`, `EligibilityForBloodDonationGap`, `InquiryofPostBloodDonationEffects`, `InquiryofPostBloodReceivalCareSchemes`, `EligibilityForBloodReceiversBloodGroup`, `EligitbilityForVaccine`, `InquiryOfLockdownDetails`, `ReportingSexualAssault`, `InquiryForVaccineCost`, `InquiryForCovidDeathCount`, `ReportingDrugConsumption`, `ReportingDrugTrafficing`, `InquiryofPostBloodDonationCertificate`, `ReportingDowry`, `ReportingChildAbuse`, `ReportingAnimalAbuse`, `InquiryofPostBloodReceivalEffects`, `Eligibility For BloodDonationWithComorbidities`, `InquiryOfTiming`, `InquiryForCovidActiveCasesCount`, `InquiryOfLocation`, `InquiryofPostBloodDonationCareSchemes`, `ReportingTheft`, `InquiryForTravelRestrictions`, `ReportingDomesticViolence`, `InquiryofBloodReceivalRequirements`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_citizen_nlu_bn_1370652766_bn_4.2.4_3.0_1670623640434.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_citizen_nlu_bn_1370652766_bn_4.2.4_3.0_1670623640434.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_citizen_nlu_bn_1370652766","bn") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier])
data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_citizen_nlu_bn_1370652766","bn")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier))
val data = Seq("I love you!").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_classifier_autotrain_citizen_nlu_bn_1370652766|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|bn|
|Size:|312.2 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/neuralspace/autotrain-citizen_nlu_bn-1370652766
---
layout: model
title: Pipeline to Detect Radiology Entities, Assign Assertion Status and Find Relations
author: John Snow Labs
name: explain_clinical_doc_radiology
date: 2023-04-20
tags: [licensed, clinical, en, ner, assertion, relation_extraction, radiology]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pipeline for detecting posology entities with the `ner_radiology` NER model, assigning their assertion status with `assertion_dl_radiology` model, and extracting relations between posology-related terminology with `re_test_problem_finding` relation extraction model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_radiology_en_4.3.0_3.2_1682019248720.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_radiology_en_4.3.0_3.2_1682019248720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("explain_clinical_doc_radiology", "en", "clinical/models")
text = """Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma."""
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("explain_clinical_doc_radiology", "en", "clinical/models")
val text = """Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma."""
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.explain_doc.clinical_radiology.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""")
```
## Results
```bash
+----+------------------------------------------+---------------------------+
| | chunks | entities |
|---:|:-----------------------------------------|:--------------------------|
| 0 | Bilateral breast | BodyPart |
| 1 | ultrasound | ImagingTest |
| 2 | ovoid mass | ImagingFindings |
| 3 | 0.5 x 0.5 x 0.4 | Measurements |
| 4 | cm | Units |
| 5 | anteromedial aspect of the left shoulder | BodyPart |
| 6 | mass | ImagingFindings |
| 7 | isoechoic echotexture | ImagingFindings |
| 8 | muscle | BodyPart |
| 9 | internal color flow | ImagingFindings |
| 10 | benign fibrous tissue | ImagingFindings |
| 11 | lipoma | Disease_Syndrome_Disorder |
+----+------------------------------------------+---------------------------+
+----+-----------------------+---------------------------+-------------+
| | chunks | entities | assertion |
|---:|:----------------------|:--------------------------|:------------|
| 0 | ultrasound | ImagingTest | Confirmed |
| 1 | ovoid mass | ImagingFindings | Confirmed |
| 2 | mass | ImagingFindings | Confirmed |
| 3 | isoechoic echotexture | ImagingFindings | Confirmed |
| 4 | internal color flow | ImagingFindings | Negative |
| 5 | benign fibrous tissue | ImagingFindings | Suspected |
| 6 | lipoma | Disease_Syndrome_Disorder | Suspected |
+----+-----------------------+---------------------------+-------------+
+---------+-----------------+-----------------------+---------------------------+------------+
|relation | entity1 | chunk1 | entity2 | chunk2 |
|--------:|:----------------|:----------------------|:--------------------------|:-----------|
| 1 | ImagingTest | ultrasound | ImagingFindings | ovoid mass |
| 0 | ImagingFindings | benign fibrous tissue | Disease_Syndrome_Disorder | lipoma |
+---------+-----------------+-----------------------+---------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_clinical_doc_radiology|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
- NerConverterInternalModel
- AssertionDLModel
- PerceptronModel
- DependencyParserModel
- RelationExtractionModel
---
layout: model
title: English asr_Fine_Tuned_XLSR_English TFWav2Vec2ForCTC from Sania67
author: John Snow Labs
name: asr_Fine_Tuned_XLSR_English
date: 2022-09-26
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Fine_Tuned_XLSR_English` is a English model originally trained by Sania67.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Fine_Tuned_XLSR_English_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Fine_Tuned_XLSR_English_en_4.2.0_3.0_1664199621537.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Fine_Tuned_XLSR_English_en_4.2.0_3.0_1664199621537.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_Fine_Tuned_XLSR_English", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_Fine_Tuned_XLSR_English", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_Fine_Tuned_XLSR_English|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_6_h_256
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-256` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_256_zh_4.2.4_3.0_1670021717667.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_256_zh_4.2.4_3.0_1670021717667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_256","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_256","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_6_h_256|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|39.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-6_H-256
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Sentence Entity Resolver for Snomed Aux Concepts, INT version (``sbiobert_base_cased_mli`` embeddings)
author: John Snow Labs
name: sbiobertresolve_snomed_auxConcepts_int
language: en
nav_key: models
repository: clinical/models
date: 2020-11-27
task: Entity Resolution
edition: Healthcare NLP 2.6.4
spark_version: 2.4
tags: [clinical,entity_resolution,en]
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model maps extracted medical entities to Snomed codes (with Morph Abnormality, Procedure, Substance, Physical Object, Body Structure concepts from INT version) using chunk embeddings.
{:.h2_title}
## Predicted Entities
Snomed Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_auxConcepts_int_en_2.6.4_2.4_1606235764318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_auxConcepts_int_en_2.6.4_2.4_1606235764318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
snomed_aux_int_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_auxConcepts_int","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_aux_int_resolver])
data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val snomed_aux_int_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_snomed_auxConcepts_int","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_aux_int_resolver))
val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
+--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+
| chunk|begin|end| entity| code|confidence| resolutions| codes|
+--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+
| hypertension| 68| 79| PROBLEM| 148439002| 0.2138|risk factors pres...|148439002:::42595...|
|chronic renal ins...| 83|109| PROBLEM| 722403003| 0.8517|gastrointestinal ...|722403003:::13781...|
| COPD| 113|116| PROBLEM|845101000000100| 0.0962|management of chr...|845101000000100::...|
| gastritis| 120|128| PROBLEM| 711498001| 0.3398|magnetic resonanc...|711498001:::71771...|
| TIA| 136|138| PROBLEM| 449758002| 0.1927|traumatic infarct...|449758002:::85844...|
|a non-ST elevatio...| 182|202| PROBLEM| 1411000087101| 0.0823|ct of left knee::...|1411000087101:::3...|
|Guaiac positive s...| 208|229| PROBLEM| 388507006| 0.0555|asparagus rast:::...|388507006:::71771...|
|cardiac catheteri...| 295|317| TEST| 41976001| 0.9790|cardiac catheteri...|41976001:::705921...|
| PTCA| 324|327|TREATMENT| 312644004| 0.0616|angioplasty of po...|312644004:::41507...|
| mid LAD lesion| 332|345| PROBLEM| 91749005| 0.1399|structure of firs...|91749005:::917470...|
+--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---------------|---------------------|
| Name: | sbiobertresolve_snomed_auxConcepts_int |
| Type: | SentenceEntityResolverModel |
| Compatibility: | Spark NLP 2.6.4 + |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [ner_chunk, chunk_embeddings] |
|Output labels: | [resolution] |
| Language: | en |
| Dependencies: | sbiobert_base_cased_mli |
{:.h2_title}
## Data Source
Trained on SNOMED (INT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings.
http://www.snomed.org/
---
layout: model
title: Translate North Germanic languages to English Pipeline
author: John Snow Labs
name: translate_gmq_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, gmq, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `gmq`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gmq_en_xx_2.7.0_2.4_1609687798521.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gmq_en_xx_2.7.0_2.4_1609687798521.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_gmq_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_gmq_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.gmq.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_gmq_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Chinese Bert Embeddings (Large, Roberta, Whole Word Masking)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_wwm_ext_large
date: 2022-04-11
tags: [bert, embeddings, zh, open_source]
task: Embeddings
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-roberta-wwm-ext-large` is a Chinese model orginally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_ext_large_zh_3.4.2_3.0_1649668978927.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_ext_large_zh_3.4.2_3.0_1649668978927.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext_large","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext_large","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.chinese_roberta_wwm_ext_large").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_wwm_ext_large|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|1.2 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/chinese-roberta-wwm-ext-large
- https://arxiv.org/abs/1906.08101
- https://github.com/google-research/bert
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/1906.08101
---
layout: model
title: Italian DistilBERT Embeddings
author: John Snow Labs
name: distilbert_embeddings_BERTino
date: 2022-04-12
tags: [distilbert, embeddings, it, open_source]
task: Embeddings
language: it
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `BERTino` is a Italian model orginally trained by `indigo-ai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_BERTino_it_3.4.2_3.0_1649783809783.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_BERTino_it_3.4.2_3.0_1649783809783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_BERTino","it") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_BERTino","it")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Adoro Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("it.embed.BERTino").predict("""Adoro Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_BERTino|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|it|
|Size:|253.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/indigo-ai/BERTino
- https://indigo.ai/en/
- https://www.corpusitaliano.it/
- https://corpora.dipintra.it/public/run.cgi/corp_info?corpname=itwac_full
- https://universaldependencies.org/treebanks/it_partut/index.html
- https://universaldependencies.org/treebanks/it_isdt/index.html
- https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
---
layout: model
title: Fast Neural Machine Translation Model from Dutch to English
author: John Snow Labs
name: opus_mt_nl_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, nl, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `nl`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_nl_en_xx_2.7.0_2.4_1609164556626.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_nl_en_xx_2.7.0_2.4_1609164556626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_nl_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_nl_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.nl.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_nl_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0_en_4.0.0_3.0_1657191165597.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0_en_4.0.0_3.0_1657191165597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from aszidon)
author: John Snow Labs
name: distilbert_qa_custom5
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom5` is a English model originally trained by `aszidon`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom5_en_4.3.0_3.0_1672774713572.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom5_en_4.3.0_3.0_1672774713572.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom5","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom5","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_custom5|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aszidon/distilbertcustom5
---
layout: model
title: Multilingual BertForQuestionAnswering model (from vanichandna)
author: John Snow Labs
name: bert_qa_bert_base_multilingual_cased_finetuned_squadv1
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-squadv1` is a Multilingual model orginally trained by `vanichandna`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_squadv1_xx_4.0.0_3.0_1654180211247.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_squadv1_xx_4.0.0_3.0_1654180211247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_finetuned_squadv1","xx") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_multilingual_cased_finetuned_squadv1","xx")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.answer_question.squad.bert.multilingual_base_cased.by_vanichandna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_multilingual_cased_finetuned_squadv1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/vanichandna/bert-base-multilingual-cased-finetuned-squadv1
---
layout: model
title: Legal Joint Filing Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_joint_filing_agreement_bert
date: 2022-11-24
tags: [en, legal, classification, agreement, joint_filing, licensed, bert]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_joint_filing_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `joint-filing-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`joint-filing-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_joint_filing_agreement_bert_en_1.0.0_3.0_1669315294183.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_joint_filing_agreement_bert_en_1.0.0_3.0_1669315294183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[joint-filing-agreement]|
|[other]|
|[other]|
|[joint-filing-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_joint_filing_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
joint-filing-agreement 1.00 0.97 0.98 31
other 0.98 1.00 0.99 65
accuracy - - 0.99 96
macro-avg 0.99 0.98 0.99 96
weighted-avg 0.99 0.99 0.99 96
```
---
layout: model
title: English BertForQuestionAnswering model (from peggyhuang)
author: John Snow Labs
name: bert_qa_finetune_bert_base_v1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-bert-base-v1` is a English model orginally trained by `peggyhuang`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_finetune_bert_base_v1_en_4.0.0_3.0_1654187728270.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_finetune_bert_base_v1_en_4.0.0_3.0_1654187728270.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_finetune_bert_base_v1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_finetune_bert_base_v1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.base.by_peggyhuang").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_finetune_bert_base_v1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/peggyhuang/finetune-bert-base-v1
---
layout: model
title: Recognize Entities DL Pipeline for Spanish - Small
author: John Snow Labs
name: entity_recognizer_sm
date: 2021-03-22
tags: [open_source, spanish, entity_recognizer_sm, pipeline, es]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: es
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_es_3.0.0_3.0_1616441492784.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_es_3.0.0_3.0_1616441492784.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'es')
annotations = pipeline.fullAnnotate(""Hola de John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "es")
val result = pipeline.fullAnnotate("Hola de John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hola de John Snow Labs! ""]
result_df = nlu.load('es.ner').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:---------------------------------------|:-----------------------|
| 0 | ['Hola de John Snow Labs! '] | ['Hola de John Snow Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | [[0.1754499971866607,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'B-MISC'] | ['John Snow', 'Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_sm|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|es|
---
layout: model
title: Legal Non Competition Agreement Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_non_competition_agreement_bert
date: 2023-01-26
tags: [en, legal, classification, non_competition, agreement, licensed, bert, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_non_competition_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `non-competition-agreement` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`non-competition-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_competition_agreement_bert_en_1.0.0_3.0_1674734515053.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_competition_agreement_bert_en_1.0.0_3.0_1674734515053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[non-competition-agreement]|
|[other]|
|[other]|
|[non-competition-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_non_competition_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
non-competition-agreement 0.88 0.93 0.90 54
other 0.96 0.94 0.95 116
accuracy - - 0.94 170
macro-avg 0.92 0.93 0.93 170
weighted-avg 0.94 0.94 0.94 170
```
---
layout: model
title: BERT multilingual base model (uncased)
author: John Snow Labs
name: bert_base_multilingual_uncased
date: 2021-05-20
tags: [xx, multilingual, embeddings, bert, open_source]
task: Embeddings
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained model on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is uncased: it does not make a difference between english and English.
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
was pretrained with two objectives:
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
- Next sentence prediction (NSP): the models concatenate two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_multilingual_uncased_xx_3.1.0_2.4_1621519949446.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_multilingual_uncased_xx_3.1.0_2.4_1621519949446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = BertEmbeddings.pretrained("bert_base_multilingual_uncased", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
```
```scala
val embeddings = BertEmbeddings.pretrained("bert_base_multilingual_uncased", "xx")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.embed.bert_base_multilingual_uncased").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_multilingual_uncased|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Case sensitive:|true|
## Data Source
[https://huggingface.co/bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased)
---
layout: model
title: Legal Execution in counterparts Clause Binary Classifier
author: John Snow Labs
name: legclf_execution_in_counterparts_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `execution-in-counterparts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `execution-in-counterparts`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_execution_in_counterparts_clause_en_1.0.0_3.2_1660122419235.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_execution_in_counterparts_clause_en_1.0.0_3.2_1660122419235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[execution-in-counterparts]|
|[other]|
|[other]|
|[execution-in-counterparts]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_execution_in_counterparts_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
execution-in-counterparts 1.00 0.95 0.98 42
other 0.98 1.00 0.99 128
accuracy - - 0.99 170
macro-avg 0.99 0.98 0.98 170
weighted-avg 0.99 0.99 0.99 170
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from shahma)
author: John Snow Labs
name: distilbert_qa_shahma_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `shahma`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_shahma_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772506673.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_shahma_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772506673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shahma_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shahma_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_shahma_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/shahma/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Legal Applicable Law Clause Binary Classifier
author: John Snow Labs
name: legclf_applic_law_clause
date: 2023-02-13
tags: [en, legal, classification, applicable, law, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `applic_law` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`applic_law`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_applic_law_clause_en_1.0.0_3.0_1676302179504.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_applic_law_clause_en_1.0.0_3.0_1676302179504.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[applic_law]|
|[other]|
|[other]|
|[applic_law]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_applic_law_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
applic_law 1.00 0.95 0.97 20
other 0.93 1.00 0.96 13
accuracy - - 0.97 33
macro-avg 0.96 0.97 0.97 33
weighted-avg 0.97 0.97 0.97 33
```
---
layout: model
title: Legal Liens Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_liens_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, liens, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Liens` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Liens`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_liens_bert_en_1.0.0_3.0_1678050683070.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_liens_bert_en_1.0.0_3.0_1678050683070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Liens]|
|[Other]|
|[Other]|
|[Liens]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_liens_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Liens 0.90 0.84 0.87 31
Other 0.91 0.94 0.93 53
accuracy - - 0.90 84
macro-avg 0.90 0.89 0.90 84
weighted-avg 0.90 0.90 0.90 84
```
---
layout: model
title: German asr_wav2vec2_large_xlsr_german_demo TFWav2Vec2ForCTC from marcel
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_german_demo
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_german_demo` is a German model originally trained by marcel.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_german_demo_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_german_demo_de_4.2.0_3.0_1664103879889.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_german_demo_de_4.2.0_3.0_1664103879889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_german_demo', lang = 'de')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_german_demo", lang = "de")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_german_demo|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Fast Neural Machine Translation Model from English to Tigrinya
author: John Snow Labs
name: opus_mt_en_ti
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ti, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ti`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ti_xx_2.7.0_2.4_1609170948430.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ti_xx_2.7.0_2.4_1609170948430.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ti", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ti", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ti').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ti|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal No Material Adverse Effect Clause Binary Classifier
author: John Snow Labs
name: legclf_no_material_adverse_effect_clause
date: 2023-01-29
tags: [en, legal, classification, material, adverse, effect, clauses, no_material_adverse_effect, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `no-material-adverse-effect` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`no-material-adverse-effect`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_material_adverse_effect_clause_en_1.0.0_3.0_1674993807450.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_material_adverse_effect_clause_en_1.0.0_3.0_1674993807450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[no-material-adverse-effect]|
|[other]|
|[other]|
|[no-material-adverse-effect]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_no_material_adverse_effect_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
no-material-adverse-effect 1.00 1.00 1.00 60
other 1.00 1.00 1.00 105
accuracy - - 1.00 165
macro-avg 1.00 1.00 1.00 165
weighted-avg 1.00 1.00 1.00 165
```
---
layout: model
title: Detect Living Species(roberta_embeddings_BR_BERTo)
author: John Snow Labs
name: ner_living_species_roberta
date: 2022-06-22
tags: [pt, ner, clinical, licensed, roberta]
task: Named Entity Recognition
language: pt
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract living species from clinical texts in Portuguese which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `roberta_embeddings_BR_BERTo` embeddings.
It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others.
**NOTE :**
1. The text files were translated from Spanish with a neural machine translation system.
2. The annotations were translated with the same neural machine translation system.
3. The translated annotations were transferred to the translated text files using an annotation transfer technology.
## Predicted Entities
`HUMAN`, `SPECIES`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_pt_3.5.3_3.0_1655923058986.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_pt_3.5.3_3.0_1655923058986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_BR_BERTo","pt")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_living_species_roberta", "pt", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""Mulher de 23 anos, de Capinota, Cochabamba, Bolívia. Ela está no nosso país há quatro anos. Frequentou o departamento de emergência obstétrica onde foi encontrada grávida de 37 semanas, com um colo dilatado de 5 cm e membranas rompidas. O obstetra de emergência realizou um teste de estreptococos negativo e solicitou um hemograma, glucose, bioquímica básica, HBV, HCV e serologia da sífilis."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_BR_BERTo","pt")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_living_species_roberta", "pt", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter))
val data = Seq("""Mulher de 23 anos, de Capinota, Cochabamba, Bolívia. Ela está no nosso país há quatro anos. Frequentou o departamento de emergência obstétrica onde foi encontrada grávida de 37 semanas, com um colo dilatado de 5 cm e membranas rompidas. O obstetra de emergência realizou um teste de estreptococos negativo e solicitou um hemograma, glucose, bioquímica básica, HBV, HCV e serologia da sífilis.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("pt.med_ner.living_species.roberta").predict("""Mulher de 23 anos, de Capinota, Cochabamba, Bolívia. Ela está no nosso país há quatro anos. Frequentou o departamento de emergência obstétrica onde foi encontrada grávida de 37 semanas, com um colo dilatado de 5 cm e membranas rompidas. O obstetra de emergência realizou um teste de estreptococos negativo e solicitou um hemograma, glucose, bioquímica básica, HBV, HCV e serologia da sífilis.""")
```
## Results
```bash
+-------------+-------+
|ner_chunk |label |
+-------------+-------+
|Mulher |HUMAN |
|grávida |HUMAN |
|estreptococos|SPECIES|
|HBV |SPECIES|
|HCV |SPECIES|
|sífilis |SPECIES|
+-------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_living_species_roberta|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|pt|
|Size:|16.4 MB|
## References
[https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/)
## Benchmarking
```bash
label precision recall f1-score support
B-HUMAN 0.86 0.91 0.88 2827
B-SPECIES 0.52 0.86 0.65 2796
I-HUMAN 0.79 0.43 0.55 180
I-SPECIES 0.62 0.81 0.70 1099
micro-avg 0.65 0.86 0.74 6902
macro-avg 0.69 0.75 0.70 6902
weighted-avg 0.68 0.86 0.75 6902
```
---
layout: model
title: Fast Neural Machine Translation Model from Azerbaijani to English
author: John Snow Labs
name: opus_mt_az_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, az, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `az`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_az_en_xx_2.7.0_2.4_1609163603785.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_az_en_xx_2.7.0_2.4_1609163603785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_az_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_az_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.az.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_az_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Japanese Bert Embeddings (Base, Character Tokenization)
author: John Snow Labs
name: bert_embeddings_bert_base_japanese_char
date: 2022-04-11
tags: [bert, embeddings, ja, open_source]
task: Embeddings
language: ja
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-char` is a Japanese model orginally trained by `cl-tohoku`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_ja_3.4.2_3.0_1649674188981.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_ja_3.4.2_3.0_1649674188981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char","ja") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char","ja")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("私はSpark NLPを愛しています").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ja.embed.bert_base_japanese_char").predict("""私はSpark NLPを愛しています""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_japanese_char|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ja|
|Size:|334.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/cl-tohoku/bert-base-japanese-char
- https://github.com/google-research/bert
- https://github.com/cl-tohoku/bert-japanese/tree/v1.0
- https://github.com/attardi/wikiextractor
- https://taku910.github.io/mecab/
- https://creativecommons.org/licenses/by-sa/3.0/
- https://www.tensorflow.org/tfrc/
---
layout: model
title: Legal International Affairs Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_international_affairs_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, international_affairs, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_international_affairs_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class International_Affairs or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`International_Affairs`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_international_affairs_bert_en_1.0.0_3.0_1678111736686.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_international_affairs_bert_en_1.0.0_3.0_1678111736686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[International_Affairs]|
|[Other]|
|[Other]|
|[International_Affairs]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_international_affairs_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
International_Affairs 0.85 0.95 0.90 43
Other 0.96 0.86 0.91 51
accuracy - - 0.90 94
macro-avg 0.91 0.91 0.90 94
weighted-avg 0.91 0.90 0.90 94
```
---
layout: model
title: Spanish RobertaForQuestionAnswering Cased model (from stevemobs)
author: John Snow Labs
name: roberta_qa_quales_iberlef_squad_2
date: 2023-01-20
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `quales-iberlef-squad_2` is a Spanish model originally trained by `stevemobs`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_quales_iberlef_squad_2_es_4.3.0_3.0_1674212017084.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_quales_iberlef_squad_2_es_4.3.0_3.0_1674212017084.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_quales_iberlef_squad_2","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_quales_iberlef_squad_2","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_quales_iberlef_squad_2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/stevemobs/quales-iberlef-squad_2
---
layout: model
title: Sentence Entity Resolver for CPT codes (Augmented)
author: John Snow Labs
name: sbiobertresolve_cpt_procedures_augmented
date: 2021-06-15
tags: [cpt, lincensed, en, clinical, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.1.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to CPT codes using Sentence Bert Embeddings.
## Predicted Entities
CPT codes and their descriptions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_augmented_en_3.1.0_3.0_1623789734339.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_augmented_en_3.1.0_3.0_1623789734339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
jsl_sbert_embedder = BertSentenceEmbeddings\
.pretrained('sbiobert_base_cased_mli','en','clinical/models')\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
cpt_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_cpt_procedures_augmented", "en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("cpt_code")
cpt_pipelineModel = PipelineModel(
stages = [
documentAssembler,
jsl_sbert_embedder,
cpt_resolver])
cpt_lp = LightPipeline(cpt_pipelineModel)
result = cpt_lp.fullAnnotate("heart surgery")
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val cpt_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_cpt_procedures_augmented", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("cpt_code")
val cpt_pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, cpt_resolver))
val cpt_lp = LightPipeline(cpt_pipelineModel)
val result = cpt_lp.fullAnnotate("heart surgery")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.cpt.procedures_augmented").predict("""heart surgery""")
```
## Results
```bash
| | chunks | code | resolutions | all_codes | all_distances |
|---:|:--------------|:----- |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------|:--------------------------------------|
| 0 | heart surgery | 33258 | [Cardiac surgery procedure [Operative tissue ablation and reconstruction of atria, performed at the time of other cardiac procedure(s), extensive (eg, maze procedure), without cardiopulmonary bypass (List separately in addition to code for primary procedure)], Cardiac surgery procedure [Unlisted procedure, cardiac surgery], Heart procedure [Interrogation device evaluation (in person) of intracardiac ischemia monitoring system with analysis, review, and report], Heart procedure [Insertion or removal and replacement of intracardiac ischemia monitoring system including imaging supervision and interpretation when performed and intra-operative interrogation and programming when performed; device only], ...]| [33258, 33999, 0306T, 0304T, ...] | [0.1031, 0.1031, 0.1377, 0.1377, ...] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_cpt_procedures_augmented|
|Compatibility:|Healthcare NLP 3.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[cpt_code]|
|Language:|en|
|Case sensitive:|true|
## Data Source
Trained on Current Procedural Terminology dataset with `sbiobert_base_cased_mli ` sentence embeddings.
---
layout: model
title: Translate Brazilian Sign Language to English Pipeline
author: John Snow Labs
name: translate_bzs_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, bzs, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `bzs`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bzs_en_xx_2.7.0_2.4_1609688839270.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bzs_en_xx_2.7.0_2.4_1609688839270.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_bzs_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_bzs_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.bzs.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_bzs_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Fast Neural Machine Translation Model from Afrikaans to German
author: John Snow Labs
name: opus_mt_af_de
date: 2021-06-01
tags: [open_source, seq2seq, translation, af, de, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: af
target languages: de
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_de_xx_3.1.0_2.4_1622561187540.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_de_xx_3.1.0_2.4_1622561187540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_af_de", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_af_de", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Afrikaans.translate_to.German').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_af_de|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Smaller BERT Sentence Embeddings (L-8_H-512_A-8)
author: John Snow Labs
name: sent_small_bert_L8_512
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_512_en_2.6.0_2.4_1598350686215.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_512_en_2.6.0_2.4_1598350686215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_512", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_512", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.small_bert_L8_512').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
en_embed_sentence_small_bert_L8_512_embeddings sentence
[0.07683686912059784, -0.09125291556119919, 1.... I hate cancer
[0.05132533982396126, 0.16612868010997772, -0.... Antibiotics aren't painkiller
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_small_bert_L8_512|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|512|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1
---
layout: model
title: Legal Control Agreement Document Classifier (Longformer)
author: John Snow Labs
name: legclf_control_agreement
date: 2022-11-10
tags: [legal, licensed, classification, control, agreement, en]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_control_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `control-agreement` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`control-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_control_agreement_en_1.0.0_3.0_1668109264398.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_control_agreement_en_1.0.0_3.0_1668109264398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[control-agreement]|
|[other]|
|[other]|
|[control-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_control_agreement|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.2 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
control-agreement 0.93 1.00 0.97 28
other 1.00 0.97 0.98 66
accuracy - - 0.98 94
macro-avg 0.97 0.98 0.98 94
weighted-avg 0.98 0.98 0.98 94
```
---
layout: model
title: Detect Assertion Status from Demographic Entities
author: John Snow Labs
name: assertion_oncology_demographic_binary_wip
date: 2022-10-01
tags: [licensed, clinical, oncology, en, assertion]
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.0
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model detects if a demographic entity refers to the patient or to someone else.
## Predicted Entities
`Patient`, `Someone_Else`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_demographic_binary_wip_en_4.1.0_3.0_1664642285987.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_demographic_binary_wip_en_4.1.0_3.0_1664642285987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["Age", "Gender"])
assertion = AssertionDLModel.pretrained("assertion_oncology_demographic_binary_wip", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
assertion])
data = spark.createDataFrame([["One sister was diagnosed with breast cancer at the age of 40."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Age", "Gender"))
val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_demographic_binary_wip","en","clinical/models")
.setInputCols(Array("sentence","ner_chunk","embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
assertion))
val data = Seq("One sister was diagnosed with breast cancer at the age of 40.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert.oncology_demographic_binary_wip").predict("""One sister was diagnosed with breast cancer at the age of 40.""")
```
## Results
```bash
| chunk | ner_label | assertion |
|:----------|:------------|:-------------|
| sister | Gender | Someone_Else |
| age of 40 | Age | Someone_Else |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|assertion_oncology_demographic_binary_wip|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, chunk, embeddings]|
|Output Labels:|[assertion_pred]|
|Language:|en|
|Size:|1.4 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label precision recall f1-score support
Patient 0.94 0.94 0.94 32.0
Someone_Else 0.92 0.92 0.92 24.0
macro-avg 0.93 0.93 0.93 56.0
weighted-avg 0.93 0.93 0.93 56.0
```
---
layout: model
title: Detect 10 Different Entities in Hebrew (hebrewner_cc_300d)
author: John Snow Labs
name: hebrewner_cc_300d
date: 2022-07-26
tags: [ner, open_source, he]
task: Named Entity Recognition
language: he
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses Hebrew word embeddings to find 10 different types of entities in Hebrew text. It is trained using `hebrewner_cc_300d ` word embeddings, please use the same embeddings in the pipeline.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hebrewner_cc_300d_he_4.0.0_3.0_1658872858968.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hebrewner_cc_300d_he_4.0.0_3.0_1658872858968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
word_embeddings = WordEmbeddingsModel.pretrained("hebrew_cc_300d", "he") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner = NerDLModel.pretrained("hebrewner_cc_300d", "he") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה. אבו-ג'וייד התכוון להקים חוליות טרור בגדה ובקרב ערביי ישראל , לבצע פיגוע ברכבת ישראל בנהריה , לפגוע במטרות ישראליות בירדן ולחטוף חיילים כדי לשחרר אסירים ביטחוניים.")
```
```scala
val embeddings = WordEmbeddingsModel.pretrained("hebrew_cc_300d", "he")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("hebrewner_cc_300d", "he")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג"וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה. אבו-ג"וייד התכוון להקים חוליות טרור בגדה ובקרב ערביי ישראל , לבצע פיגוע ברכבת ישראל בנהריה , לפגוע במטרות ישראליות בירדן ולחטוף חיילים כדי לשחרר אסירים ביטחוניים.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("he.ner").predict("""ח והופעל על ידי חיזבאללה. אבו-ג'וייד התכוון להקים חוליות טרור בגדה ובקרב ערביי ישראל , לבצע פיגוע ברכבת ישראל בנהריה , לפגוע במטרות ישראליות בירדן ולחטוף חיילים כדי לשחרר אסירים ביטחוניים.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_biomarker_pipeline", "en", "clinical/models")
pipeline.annotate("Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin ")
```
```scala
val pipeline = new PretrainedPipeline("ner_biomarker_pipeline", "en", "clinical/models")
pipeline.annotate("Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin ")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.biomarker.pipeline").predict("""Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin """)
```
## Results
```bash
| | ner_chunk | entity | confidence |
|---:|:-------------------------|:----------------------|-------------:|
| 0 | intraductal | CancerModifier | 0.9934 |
| 1 | tubulopapillary | CancerModifier | 0.6403 |
| 2 | neoplasm of the pancreas | CancerDx | 0.758825 |
| 3 | clear cell | CancerModifier | 0.9633 |
| 4 | Immunohistochemistry | Test | 0.9534 |
| 5 | positivity | Biomarker_Measurement | 0.8795 |
| 6 | Pan-CK | Biomarker | 0.9975 |
| 7 | CK7 | Biomarker | 0.9975 |
| 8 | CK8/18 | Biomarker | 0.9987 |
| 9 | MUC1 | Biomarker | 0.9967 |
| 10 | MUC6 | Biomarker | 0.9972 |
| 11 | carbonic anhydrase IX | Biomarker | 0.937567 |
| 12 | CD10 | Biomarker | 0.9974 |
| 13 | EMA | Biomarker | 0.9899 |
| 14 | β-catenin | Biomarker | 0.8059 |
| 15 | e-cadherin | Biomarker | 0.9806 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_biomarker_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: English image_classifier_vit_base_beans ViTForImageClassification from karthiksv
author: John Snow Labs
name: image_classifier_vit_base_beans
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_beans` is a English model originally trained by karthiksv.
## Predicted Entities
`angular_leaf_spot`, `bean_rust`, `healthy`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_en_4.1.0_3.0_1660169273658.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_en_4.1.0_3.0_1660169273658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_base_beans", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_base_beans", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_base_beans|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_Tommi TFWav2Vec2ForCTC from Tommi
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_Tommi` is a Finnish model originally trained by Tommi.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_fi_4.2.0_3.0_1664021001412.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_fi_4.2.0_3.0_1664021001412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi', lang = 'fi')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi", lang = "fi")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Word2Vec Embeddings in Urdu (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, ur, open_source]
task: Embeddings
language: ur
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ur_3.4.1_3.0_1647465404657.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ur_3.4.1_3.0_1647465404657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ur") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["مجھے سپارک این ایل پی سے محبت ہے"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ur")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("مجھے سپارک این ایل پی سے محبت ہے").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ur.embed.w2v_cc_300d").predict("""مجھے سپارک این ایل پی سے محبت ہے""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|ur|
|Size:|672.4 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Japanese Bert Embeddings (Base)
author: John Snow Labs
name: bert_embeddings_bert_base_japanese_char_extended
date: 2022-04-11
tags: [bert, embeddings, ja, open_source]
task: Embeddings
language: ja
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-char-extended` is a Japanese model orginally trained by `KoichiYasuoka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_extended_ja_3.4.2_3.0_1649674670666.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_extended_ja_3.4.2_3.0_1649674670666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char_extended","ja") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char_extended","ja")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("私はSpark NLPを愛しています").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ja.embed.bert_base_japanese_char_extended").predict("""私はSpark NLPを愛しています""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_japanese_char_extended|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ja|
|Size:|341.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/KoichiYasuoka/bert-base-japanese-char-extended
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from mrm8488)
author: John Snow Labs
name: t5_small_finetuned_imdb_sentiment
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-finetuned-imdb-sentiment` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_imdb_sentiment_en_4.3.0_3.0_1675126033560.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_imdb_sentiment_en_4.3.0_3.0_1675126033560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_small_finetuned_imdb_sentiment","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_small_finetuned_imdb_sentiment","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_small_finetuned_imdb_sentiment|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|267.7 MB|
## References
- https://huggingface.co/mrm8488/t5-small-finetuned-imdb-sentiment
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/pdf/1910.10683.pdf
- https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67
- https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb
- https://github.com/patil-suraj
- https://twitter.com/mrm8488
- https://www.linkedin.com/in/manuel-romero-cs/
---
layout: model
title: English asr_asr_with_transformers_wav2vec2 TFWav2Vec2ForCTC from osanseviero
author: John Snow Labs
name: asr_asr_with_transformers_wav2vec2
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_asr_with_transformers_wav2vec2` is a English model originally trained by osanseviero.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_asr_with_transformers_wav2vec2_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_asr_with_transformers_wav2vec2_en_4.2.0_3.0_1664043824301.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_asr_with_transformers_wav2vec2_en_4.2.0_3.0_1664043824301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_asr_with_transformers_wav2vec2", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_asr_with_transformers_wav2vec2", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_asr_with_transformers_wav2vec2|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|227.6 MB|
---
layout: model
title: Lithuanian asr_common_voice_lithuanian_fairseq TFWav2Vec2ForCTC from birgermoell
author: John Snow Labs
name: asr_common_voice_lithuanian_fairseq
date: 2022-09-26
tags: [wav2vec2, lt, audio, open_source, asr]
task: Automatic Speech Recognition
language: lt
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_common_voice_lithuanian_fairseq` is a Lithuanian model originally trained by birgermoell.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_common_voice_lithuanian_fairseq_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_common_voice_lithuanian_fairseq_lt_4.2.0_3.0_1664202849902.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_common_voice_lithuanian_fairseq_lt_4.2.0_3.0_1664202849902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_common_voice_lithuanian_fairseq", "lt")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_common_voice_lithuanian_fairseq", "lt")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_common_voice_lithuanian_fairseq|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|lt|
|Size:|228.0 MB|
---
layout: model
title: English RobertaForQuestionAnswering (from ydshieh)
author: John Snow Labs
name: roberta_qa_ydshieh_roberta_base_squad2
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `ydshieh`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ydshieh_roberta_base_squad2_en_4.0.0_3.0_1655735038173.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ydshieh_roberta_base_squad2_en_4.0.0_3.0_1655735038173.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ydshieh_roberta_base_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_ydshieh_roberta_base_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base.by_ydshieh").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_ydshieh_roberta_base_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ydshieh/roberta-base-squad2
- https://www.linkedin.com/company/deepset-ai/
- https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/FARM/issues/552
- https://github.com/deepset-ai/FARM
- http://www.deepset.ai/jobs
- https://twitter.com/deepset_ai
- https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py
- https://github.com/deepset-ai/haystack/discussions
- https://github.com/deepset-ai/haystack/
- https://deepset.ai
- https://deepset.ai/germanquad
- https://deepset.ai/german-bert
---
layout: model
title: Catalan, Valencian asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala TFWav2Vec2ForCTC from softcatala
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala
date: 2022-09-24
tags: [wav2vec2, ca, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: ca
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala` is a Catalan, Valencian model originally trained by softcatala.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_ca_4.2.0_3.0_1664037119962.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_ca_4.2.0_3.0_1664037119962.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala', lang = 'ca')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala", lang = "ca")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|ca|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: SDOH Housing Insecurity For Classification
author: John Snow Labs
name: genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli
date: 2023-05-30
tags: [en, licensed, biobert, sdoh, housing, generic_classifier, housing_insecurity]
task: Text Classification
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: GenericClassifierModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Generic Classifier model is intended for detecting whether the patient has housing insecurity. If the clinical note includes patient housing problems, the model identifies it. If there is no housing issue or it is not mentioned in the text, it is regarded as “No_Housing_Insecurity_Or_Not_Mentioned”. The model is trained by using GenericClassifierApproach annotator.
`Housing_Insecurity`: The patient has housing problems.
`No_Housing_Insecurity_Or_Not_Mentioned`: The patient has no housing problems or it is not mentioned in the clinical notes.
## Predicted Entities
`Housing_Insecurity`, `No_Housing_Insecurity_Or_Not_Mentioned`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_CLASSIFICATION_GENERIC/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli_en_4.4.2_3.0_1685474945970.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli_en_4.4.2_3.0_1685474945970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
features_asm = FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli", 'en', 'clinical/models')\
.setInputCols(["features"])\
.setOutputCol("prediction")
pipeline = Pipeline(stages=[
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier
])
text_list = ["The patient is homeless.",
"Patient is a 50-year-old male who no has stable housing. He recently underwent a hip replacement surgery and has made a full recovery. ",
"Patient is a 25-year-old female who has her private housing. She presented with symptoms of a urinary tract infection and was diagnosed with the condition. Her living situation has allowed her to receive prompt medical care and treatment, and she has made a full recovery. ",
"""Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and allergies. She has been managing her conditions with medication and regular follow-up appointments with her healthcare provider. She lives in a rented apartment with her husband and two children and has been stably housed for the past five years.
Presenting problem: Mary presents to the clinic for a routine check-up and reports no significant changes in her health status or symptoms related to her asthma or allergies. However, she expresses concerns about the quality of the air in her apartment and potential environmental triggers that could impact her health.
Medical history: Mary has a medical history of asthma and allergies. She takes an inhaler and antihistamines to manage her conditions.
Social history: Mary is married with two children and lives in a rented apartment. She and her husband both work full-time jobs and have health insurance. They have savings and are able to cover basic expenses.
Assessment: The clinician assesses Mary's medical conditions and determines that her asthma and allergies are stable and well-controlled. The clinician also assesses Mary's housing situation and determines that her apartment building is in good condition and does not present any immediate environmental hazards.
Plan: The clinician advises Mary to continue to monitor her health conditions and to report any changes or concerns to her healthcare team. The clinician also prescribes a referral to an allergist who can provide additional evaluation and treatment for her allergies. The clinician recommends that Mary and her family take steps to minimize potential environmental triggers in their apartment, such as avoiding smoking and using air purifiers. The clinician advises Mary to continue to maintain her stable housing situation and to seek assistance if any financial or housing issues arise.
""",
"""Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing insecurity for the past year. She was evicted from her apartment due to an increase in rent, which she could not afford, and has been staying with friends and family members ever since. She works as a part-time sales associate at a retail store and has no medical insurance.
Presenting problem: Sarah presents to the clinic with complaints of increased stress and anxiety related to her housing insecurity. She reports feeling constantly on edge and worried about where she will sleep each night. She is also having difficulty concentrating at work and has been missing shifts due to her anxiety.
Medical history: Sarah has no significant medical history and takes no medications.
Social history: Sarah is currently single and has no children. She has a high school diploma but has not attended college. She has been working at her current job for three years and earns minimum wage. She has no savings and relies on her income to cover basic expenses.
Assessment: The clinician assesses Sarah's mental health and determines that she is experiencing symptoms of anxiety and depression related to her housing insecurity. The clinician also assesses Sarah's housing situation and determines that she is at risk for homelessness if she is unable to secure stable housing soon.
Plan: The clinician refers Sarah to a social worker who can help her connect with local housing resources, including subsidized housing programs and emergency shelters. The clinician also prescribes an antidepressant medication to help manage her symptoms of anxiety and depression. The clinician advises Sarah to continue to seek employment opportunities that may offer higher pay and stability."""]
df = spark.createDataFrame(text_list, StringType()).toDF("text")
result = pipeline.fit(df).transform(df)
result.select("text", "prediction.result").show(truncate=100)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("features")
val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli", "en", "clinical/models")
.setInputCols("features")
.setOutputCol("prediction")
val pipeline = new PipelineModel().setStages(Array(
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier))
val data = Seq(Array("The patient is homeless.",
"Patient is a 50-year-old male who no has stable housing. He recently underwent a hip replacement surgery and has made a full recovery. ",
"Patient is a 25-year-old female who has her private housing. She presented with symptoms of a urinary tract infection and was diagnosed with the condition. Her living situation has allowed her to receive prompt medical care and treatment, and she has made a full recovery. ",
"""Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and allergies. She has been managing her conditions with medication and regular follow-up appointments with her healthcare provider. She lives in a rented apartment with her husband and two children and has been stably housed for the past five years.
Presenting problem: Mary presents to the clinic for a routine check-up and reports no significant changes in her health status or symptoms related to her asthma or allergies. However, she expresses concerns about the quality of the air in her apartment and potential environmental triggers that could impact her health.
Medical history: Mary has a medical history of asthma and allergies. She takes an inhaler and antihistamines to manage her conditions.
Social history: Mary is married with two children and lives in a rented apartment. She and her husband both work full-time jobs and have health insurance. They have savings and are able to cover basic expenses.
Assessment: The clinician assesses Mary's medical conditions and determines that her asthma and allergies are stable and well-controlled. The clinician also assesses Mary's housing situation and determines that her apartment building is in good condition and does not present any immediate environmental hazards.
Plan: The clinician advises Mary to continue to monitor her health conditions and to report any changes or concerns to her healthcare team. The clinician also prescribes a referral to an allergist who can provide additional evaluation and treatment for her allergies. The clinician recommends that Mary and her family take steps to minimize potential environmental triggers in their apartment, such as avoiding smoking and using air purifiers. The clinician advises Mary to continue to maintain her stable housing situation and to seek assistance if any financial or housing issues arise.
""",
"""Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing insecurity for the past year. She was evicted from her apartment due to an increase in rent, which she could not afford, and has been staying with friends and family members ever since. She works as a part-time sales associate at a retail store and has no medical insurance.
Presenting problem: Sarah presents to the clinic with complaints of increased stress and anxiety related to her housing insecurity. She reports feeling constantly on edge and worried about where she will sleep each night. She is also having difficulty concentrating at work and has been missing shifts due to her anxiety.
Medical history: Sarah has no significant medical history and takes no medications.
Social history: Sarah is currently single and has no children. She has a high school diploma but has not attended college. She has been working at her current job for three years and earns minimum wage. She has no savings and relies on her income to cover basic expenses.
Assessment: The clinician assesses Sarah's mental health and determines that she is experiencing symptoms of anxiety and depression related to her housing insecurity. The clinician also assesses Sarah's housing situation and determines that she is at risk for homelessness if she is unable to secure stable housing soon.
Plan: The clinician refers Sarah to a social worker who can help her connect with local housing resources, including subsidized housing programs and emergency shelters. The clinician also prescribes an antidepressant medication to help manage her symptoms of anxiety and depression. The clinician advises Sarah to continue to seek employment opportunities that may offer higher pay and stability.""")).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+----------------------------------------+
| text| result|
+----------------------------------------------------------------------------------------------------+----------------------------------------+
| The patient is homeless.| [Housing_Insecurity]|
|Patient is a 50-year-old male who no has stable housing. He recently underwent a hip replacement ...| [Housing_Insecurity]|
|Patient is a 25-year-old female who has her private housing. She presented with symptoms of a uri...|[No_Housing_Insecurity_Or_Not_Mentioned]|
|Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and a...|[No_Housing_Insecurity_Or_Not_Mentioned]|
|Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing ins...| [Housing_Insecurity]|
+----------------------------------------------------------------------------------------------------+----------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[features]|
|Output Labels:|[prediction]|
|Language:|en|
|Size:|3.4 MB|
|Dependencies:|sbiobert_base_cased_mli|
## References
Internal SDOH Project
## Benchmarking
```bash
label precision recall f1-score support
Housing_Insecurity 0.81 0.92 0.86 37
No_Housing_Insecurity_Or_Not_Mentioned 0.95 0.87 0.90 60
accuracy - - 0.89 97
macro-avg 0.88 0.89 0.88 97
weighted-avg 0.89 0.89 0.89 97
```
---
layout: model
title: English BertForQuestionAnswering model (from vanichandna)
author: John Snow Labs
name: bert_qa_muril_finetuned_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `muril-finetuned-squad` is a English model orginally trained by `vanichandna`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_muril_finetuned_squad_en_4.0.0_3.0_1654188629286.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_muril_finetuned_squad_en_4.0.0_3.0_1654188629286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_muril_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_muril_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.by_vanichandna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_muril_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|891.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/vanichandna/muril-finetuned-squad
---
layout: model
title: English RobertaForQuestionAnswering (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739600551.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739600551.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base_rule_based_only_classfn_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|460.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0
---
layout: model
title: Translate English to Ilocano Pipeline
author: John Snow Labs
name: translate_en_ilo
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, ilo, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `ilo`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ilo_xx_2.7.0_2.4_1609690588205.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ilo_xx_2.7.0_2.4_1609690588205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_ilo", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_ilo", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.ilo').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_ilo|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Castilian, Spanish BertForQuestionAnswering model (from CenIA)
author: John Snow Labs
name: bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-uncased-finetuned-qa-sqac` is a Castilian, Spanish model orginally trained by `CenIA`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac_es_4.0.0_3.0_1654180585886.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac_es_4.0.0_3.0_1654180585886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.sqac.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|410.2 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/CenIA/bert-base-spanish-wwm-uncased-finetuned-qa-sqac
---
layout: model
title: Estonain Lemmatizer
author: John Snow Labs
name: lemma
date: 2020-11-28
task: Lemmatization
language: et
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [lemmatizer, et, open_source]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_et_2.7.0_2.4_1606580379171.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_et_2.7.0_2.4_1606580379171.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of a pipeline after tokenisation.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
lemmatizer = LemmatizerModel.pretrained("lemma", "et") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate(['üheksandana üheksas üheksanda Üheksas'])
```
```scala
...
val lemmatizer = LemmatizerModel.pretrained("lemma", "et")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("üheksandana üheksas üheksanda Üheksas").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["üheksandana üheksas üheksanda Üheksas"]
lemma_df = nlu.load('et.lemma').predict(text, output_level='document')
lemma_df.lemma.values[0]
```
## Results
```bash
{'lemma': [Annotation(token, 0, 10, üheksas, {'sentence': '0'}),
Annotation(token, 12, 18, üheksas, {'sentence': '0'}),
Annotation(token, 20, 28, üheksas, {'sentence': '0'}),
Annotation(token, 30, 36, üheksas, {'sentence': '0'})]}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|et|
## Data Source
This model is trained on data obtained from [https://universaldependencies.org/](https://universaldependencies.org/)
---
layout: model
title: Pipeline to Detect problem, test, treatment in medical text (biobert)
author: John Snow Labs
name: ner_clinical_biobert_pipeline
date: 2023-03-20
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_clinical_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_clinical_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_biobert_pipeline_en_4.3.0_3.2_1679314695992.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_biobert_pipeline_en_4.3.0_3.2_1679314695992.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_clinical_biobert_pipeline", "en", "clinical/models")
text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_clinical_biobert_pipeline", "en", "clinical/models")
val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.clinical_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:----------------------------------------------------|--------:|------:|:------------|-------------:|
| 0 | congestion | 62 | 71 | PROBLEM | 0.5069 |
| 1 | some mild problems with his breathing while feeding | 163 | 213 | PROBLEM | 0.694063 |
| 2 | any perioral cyanosis | 233 | 253 | PROBLEM | 0.6493 |
| 3 | retractions | 258 | 268 | PROBLEM | 0.9971 |
| 4 | a tactile temperature | 302 | 322 | PROBLEM | 0.8294 |
| 5 | Tylenol | 345 | 351 | TREATMENT | 0.665 |
| 6 | some decreased p.o | 372 | 389 | PROBLEM | 0.771067 |
| 7 | His normal breast-feeding | 400 | 424 | TEST | 0.736767 |
| 8 | his respiratory congestion | 488 | 513 | PROBLEM | 0.745767 |
| 9 | more tired | 545 | 554 | PROBLEM | 0.6514 |
| 10 | fussy | 569 | 573 | PROBLEM | 0.6512 |
| 11 | albuterol treatments | 637 | 656 | TREATMENT | 0.8917 |
| 12 | His urine output | 675 | 690 | TEST | 0.7114 |
| 13 | any diarrhea | 832 | 843 | PROBLEM | 0.73595 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_clinical_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.1 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Pipeline to Mapping SNOMED Codes with Their Corresponding UMLS Codes
author: John Snow Labs
name: snomed_umls_mapping
date: 2023-03-29
tags: [en, licensed, clinical, pipeline, chunk_mapping, snomed, umls]
task: Chunk Mapping
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of `snomed_umls_mapper` model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapping_en_4.3.2_3.2_1680124512179.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapping_en_4.3.2_3.2_1680124512179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("snomed_umls_mapping", "en", "clinical/models")
result = pipeline.fullAnnotate(733187009 449433008 51264003)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("snomed_umls_mapping", "en", "clinical/models")
val result = pipeline.fullAnnotate(733187009 449433008 51264003)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.snomed.umls.mapping").predict("""Put your text here.""")
```
## Results
```bash
| | snomed_code | umls_code |
|---:|:---------------------------------|:-------------------------------|
| 0 | 733187009 | 449433008 | 51264003 | C4546029 | C3164619 | C0271267 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|snomed_umls_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|5.1 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ChunkMapperModel
---
layout: model
title: English BertForQuestionAnswering Cased model (from Nadav)
author: John Snow Labs
name: bert_qa_macsquad
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MacSQuAD` is a English model originally trained by `Nadav`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_macsquad_en_4.0.0_3.0_1657182038098.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_macsquad_en_4.0.0_3.0_1657182038098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_macsquad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_macsquad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_macsquad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Nadav/MacSQuAD
---
layout: model
title: Stop Words Cleaner for Indonesian
author: John Snow Labs
name: stopwords_id
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: id
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, id]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_id_id_2.5.4_2.4_1594742441630.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_id_id_2.5.4_2.4_1594742441630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_id", "id") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_id", "id")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis."""]
stopword_df = nlu.load('id.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=7, end=13, result='menjadi', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=15, end=18, result='raja', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=20, end=24, result='utara', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=25, end=25, result=',', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=27, end=30, result='John', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_id|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|id|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: Persian DistilBERT Embeddings (from HooshvareLab)
author: John Snow Labs
name: distilbert_embeddings_distilbert_fa_zwnj_base
date: 2022-04-12
tags: [distilbert, embeddings, fa, open_source]
task: Embeddings
language: fa
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-fa-zwnj-base` is a Persian model orginally trained by `HooshvareLab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_fa_zwnj_base_fa_3.4.2_3.0_1649783880670.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_fa_zwnj_base_fa_3.4.2_3.0_1649783880670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_fa_zwnj_base","fa") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["من عاشق جرقه NLP هستم"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_fa_zwnj_base","fa")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("من عاشق جرقه NLP هستم").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("fa.embed.distilbert_fa_zwnj_base").predict("""من عاشق جرقه NLP هستم""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_distilbert_fa_zwnj_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|fa|
|Size:|282.6 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/HooshvareLab/distilbert-fa-zwnj-base
- https://github.com/hooshvare/parsbert/issues
---
layout: model
title: Extract Cancer Therapies and Posology Information
author: John Snow Labs
name: ner_oncology_unspecific_posology_healthcare
date: 2023-01-11
tags: [licensed, clinical, oncology, en, ner, treatment, posology]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts mentions of treatments and posology information using unspecific labels (low granularity).
## Predicted Entities
`Posology_Information`, `Cancer_Therapy`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_healthcare_en_4.2.4_3.0_1673475870938.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_healthcare_en_4.2.4_3.0_1673475870938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel()\
.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel\
.pretrained("ner_oncology_unspecific_posology_healthcare", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel
.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel()
.pretrained("embeddings_healthcare_100d", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology_healthcare", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_unspecific_posology_healthcare").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""")
```
## Results
```bash
| chunk | ner_label |
|:-----------------|:---------------------|
| adriamycin | Cancer_Therapy |
| 60 mg/m2 | Posology_Information |
| cyclophosphamide | Cancer_Therapy |
| 600 mg/m2 | Posology_Information |
| over six courses | Posology_Information |
| second cycle | Posology_Information |
| chemotherapy | Cancer_Therapy |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_unspecific_posology_healthcare|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|33.8 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Posology_Information 1435 102 210 1645 0.93 0.87 0.90
Cancer_Therapy 1281 116 125 1406 0.92 0.91 0.91
macro-avg 2716 218 335 3051 0.93 0.89 0.91
micro-avg 2716 218 335 3051 0.93 0.89 0.91
```
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket TFWav2Vec2ForCTC from lilitket
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket` is a English model originally trained by lilitket.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket_en_4.2.0_3.0_1664095206434.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket_en_4.2.0_3.0_1664095206434.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from LucasS)
author: John Snow Labs
name: roberta_qa_robertaabsa
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robertaABSA` is a English model originally trained by `LucasS`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_robertaabsa_en_4.3.0_3.0_1674222776379.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_robertaabsa_en_4.3.0_3.0_1674222776379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertaabsa","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertaabsa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_robertaabsa|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|437.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/LucasS/robertaABSA
---
layout: model
title: Financial English BERT Embeddings (Number masking)
author: John Snow Labs
name: bert_embeddings_sec_bert_num
date: 2022-04-12
tags: [bert, embeddings, en, open_source, financial]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Financial Pretrained BERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `sec-bert-num` is a English model orginally trained by `nlpaueb`. This model is the same as Bert Base but we replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation).
If you are interested in Financial Embeddings, take a look also at these two models:
[sec-base](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_base_en_3_0.html): Same as Bert Base but trained with financial documents.
[sec-shape](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_sh_en_3_0.html): Same as Bert sec-base but we replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_num_en_3.4.2_3.0_1649759295271.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_num_en_3.4.2_3.0_1649759295271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_num","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_num","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.sec_bert_num").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_sec_bert_num|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|409.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/nlpaueb/sec-bert-num
- https://arxiv.org/abs/2203.06482
- http://nlp.cs.aueb.gr/
---
layout: model
title: English AlbertForQuestionAnswering Large model (from elgeish)
author: John Snow Labs
name: albert_qa_cs224n_squad2.0_large_v2
date: 2022-06-24
tags: [en, open_source, albert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: AlBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cs224n-squad2.0-albert-large-v2` is a English model originally trained by `elgeish`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_cs224n_squad2.0_large_v2_en_4.0.0_3.0_1656064295571.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_cs224n_squad2.0_large_v2_en_4.0.0_3.0_1656064295571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_cs224n_squad2.0_large_v2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_cs224n_squad2.0_large_v2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.albert.large_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_qa_cs224n_squad2.0_large_v2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|63.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/elgeish/cs224n-squad2.0-albert-large-v2
- http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf
- https://rajpurkar.github.io/SQuAD-explorer/
- https://github.com/elgeish/squad/tree/master/data
---
layout: model
title: Turkish DistilBERT Embeddings (from Geotrend)
author: John Snow Labs
name: distilbert_embeddings_distilbert_base_tr_cased
date: 2022-04-12
tags: [distilbert, embeddings, tr, open_source]
task: Embeddings
language: tr
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-tr-cased` is a Turkish model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_tr_cased_tr_3.4.2_3.0_1649783637258.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_tr_cased_tr_3.4.2_3.0_1649783637258.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_tr_cased","tr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_tr_cased","tr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Spark NLP'yi seviyorum").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("tr.embed.distilbert_base_cased").predict("""Spark NLP'yi seviyorum""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_distilbert_base_tr_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|tr|
|Size:|216.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/distilbert-base-tr-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Cyberbullying Classifier
author: John Snow Labs
name: classifierdl_use_cyberbullying
date: 2021-01-09
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 2.7.1
spark_version: 2.4
tags: [open_source, en, classifier]
supported: true
annotator: ClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Identify Racism, Sexism or Neutral tweets.
## Predicted Entities
`neutral`, `racism`, `sexism`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_CYBERBULLYING/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN_CYBERBULLYING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_cyberbullying_en_2.7.1_2.4_1610188083627.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_cyberbullying_en_2.7.1_2.4_1610188083627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
document_classifier = ClassifierDLModel.pretrained('classifierdl_use_cyberbullying', 'en') \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")
nlpPipeline = Pipeline(stages=[document_assembler, use, document_classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate('@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked')
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val use = UniversalSentenceEncoder.pretrained(lang="en")
.setInputCols(Array("document"))
.setOutputCol("sentence_embeddings")
val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_cyberbullying", "en")
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier))
val data = Seq("@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked"""]
cyberbull_df = nlu.load('classify.cyberbullying.use').predict(text, output_level='document')
cyberbull_df[["document", "cyberbullying"]]
```
{:.nlu-block}
```python
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.cyberbullying").predict("""@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked""")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.cyberbullying").predict("""@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked""")
```
## Results
```bash
+--------------------------------------------------------------------------------------------------------+------------+
|document |class |
+--------------------------------------------------------------------------------------------------------+------------+
|@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked. | racism |
+--------------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_use_cyberbullying|
|Compatibility:|Spark NLP 2.7.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Dependencies:|tfhub_use|
## Data Source
This model is trained on cyberbullying detection dataset. https://raw.githubusercontent.com/dhavalpotdar/cyberbullying-detection/master/data/data/data.csv
## Benchmarking
```bash
precision recall f1-score support
neutral 0.72 0.76 0.74 700
racism 0.89 0.94 0.92 773
sexism 0.82 0.71 0.76 622
accuracy 0.81 2095
macro avg 0.81 0.80 0.80 2095
weighted avg 0.81 0.81 0.81 2095
```
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from haesun)
author: John Snow Labs
name: xlmroberta_ner_haesun_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `haesun`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_haesun_base_finetuned_panx_de_4.1.0_3.0_1660433456248.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_haesun_base_finetuned_panx_de_4.1.0_3.0_1660433456248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_haesun_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_haesun_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_haesun_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/haesun/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: English BertForQuestionAnswering Cased model (from aozorahime)
author: John Snow Labs
name: bert_qa_my_new_model
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `my-new-model` is a English model originally trained by `aozorahime`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_my_new_model_en_4.0.0_3.0_1657190466916.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_my_new_model_en_4.0.0_3.0_1657190466916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_my_new_model","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_my_new_model","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_my_new_model|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aozorahime/my-new-model
---
layout: model
title: Pipeline to Detect diseases in medical text (biobert)
author: John Snow Labs
name: ner_diseases_biobert_pipeline
date: 2023-03-20
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_diseases_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_diseases_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_pipeline_en_4.3.0_3.2_1679315318481.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_pipeline_en_4.3.0_3.2_1679315318481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_diseases_biobert_pipeline", "en", "clinical/models")
text = '''Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_diseases_biobert_pipeline", "en", "clinical/models")
val text = "Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.diseases_biobert.pipeline").predict("""Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:----------------------|--------:|------:|:------------|-------------:|
| 0 | interstitial cystitis | 61 | 81 | Disease | 0.99655 |
| 1 | mastocytosis | 129 | 140 | Disease | 0.8569 |
| 2 | cystitis | 209 | 216 | Disease | 0.9717 |
| 3 | prostate cancer | 355 | 369 | Disease | 0.85965 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_diseases_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.2 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Chinese BertForMaskedLM Cased model (from hfl)
author: John Snow Labs
name: bert_embeddings_rbt3
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt3` is a Chinese model originally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt3_zh_4.2.4_3.0_1670327065192.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt3_zh_4.2.4_3.0_1670327065192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt3","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt3","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_rbt3|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|144.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/rbt3
- https://arxiv.org/abs/1906.08101
- https://github.com/google-research/bert
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/1906.08101
---
layout: model
title: English asr_wav2vec2_base_test TFWav2Vec2ForCTC from cahya
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_test
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_test` is a English model originally trained by cahya.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_test_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_test_en_4.2.0_3.0_1664035679567.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_test_en_4.2.0_3.0_1664035679567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_test', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_test", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_test|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|348.7 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from jgammack)
author: John Snow Labs
name: distilbert_qa_mtl_base_uncased_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MTL-distilbert-base-uncased-squad` is a English model originally trained by `jgammack`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_mtl_base_uncased_squad_en_4.3.0_3.0_1672765509575.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_mtl_base_uncased_squad_en_4.3.0_3.0_1672765509575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mtl_base_uncased_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mtl_base_uncased_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_mtl_base_uncased_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/jgammack/MTL-distilbert-base-uncased-squad
---
layout: model
title: Russian RoBERTa Embeddings
author: John Snow Labs
name: roberta_embeddings_ruRoberta_large
date: 2022-04-14
tags: [roberta, embeddings, ru, open_source]
task: Embeddings
language: ru
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `ruRoberta-large` is a Russian model orginally trained by `sberbank-ai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_ruRoberta_large_ru_3.4.2_3.0_1649947722752.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_ruRoberta_large_ru_3.4.2_3.0_1649947722752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_ruRoberta_large","ru") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Я люблю искра NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_ruRoberta_large","ru")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Я люблю искра NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ru.embed.ruRoberta_large").predict("""Я люблю искра NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_ruRoberta_large|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ru|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/sberbank-ai/ruRoberta-large
- https://sberdevices.ru/
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from leabum)
author: John Snow Labs
name: distilbert_qa_base_uncased_finetuned_cuad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-cuad` is a English model originally trained by `leabum`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_cuad_en_4.3.0_3.0_1672767856883.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_cuad_en_4.3.0_3.0_1672767856883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_cuad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_cuad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_finetuned_cuad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/leabum/distilbert-base-uncased-finetuned-cuad
---
layout: model
title: Emotional Stress Classifier (BERT)
author: John Snow Labs
name: bert_sequence_classifier_stress
date: 2022-06-28
tags: [sequence_classification, bert, en, licensed, stress, mental, public_health]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [PHS-BERT-based](https://huggingface.co/publichealthsurveillance/PHS-BERT) classifier that can classify whether the content of a text expresses emotional stress.
## Predicted Entities
`no stress`, `stress`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_STRESS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_stress_en_4.0.0_3.0_1656438010655.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_stress_en_4.0.0_3.0_1656438010655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_stress", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame([["No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_stress", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq("No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.stress").predict("""No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?""")
```
## Results
```bash
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|text | class|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?|[stress]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_stress|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
[Dreaddit dataset](https://arxiv.org/abs/1911.00133)
## Benchmarking
```bash
label precision recall f1-score support
no-stress 0.83 0.82 0.83 334
stress 0.85 0.85 0.85 377
accuracy - - 0.84 711
macro-avg 0.84 0.84 0.84 711
weighted-avg 0.84 0.84 0.84 711
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from rpv)
author: John Snow Labs
name: distilbert_qa_rpv_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `rpv`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_rpv_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772340836.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_rpv_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772340836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rpv_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rpv_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_rpv_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/rpv/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Japanese asr_wav2vec2_large_xlsr_japanese_hiragana TFWav2Vec2ForCTC from vumichien
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_japanese_hiragana
date: 2022-09-25
tags: [wav2vec2, ja, audio, open_source, asr]
task: Automatic Speech Recognition
language: ja
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_japanese_hiragana` is a Japanese model originally trained by vumichien.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_japanese_hiragana_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_japanese_hiragana_ja_4.2.0_3.0_1664122415445.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_japanese_hiragana_ja_4.2.0_3.0_1664122415445.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_japanese_hiragana", "ja")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_japanese_hiragana", "ja")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_japanese_hiragana|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|ja|
|Size:|1.2 GB|
---
layout: model
title: Pipeline to Detect PHI in Text (enriched-biobert)
author: John Snow Labs
name: ner_deid_enriched_biobert_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, deidentification, enriched_biobert, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_enriched_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_enriched_biobert_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_pipeline_en_3.4.1_3.0_1647868393082.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_pipeline_en_3.4.1_3.0_1647868393082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_enriched_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""")
```
```scala
val pipeline = new PretrainedPipeline("ner_deid_enriched_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.ner_enriched_biobert.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""")
```
## Results
```bash
+-----------------------------+------------+
|chunks |entities |
+-----------------------------+------------+
|2093-01-13 |DATE |
|David Hale |DOCTOR |
|Hendrickson, Ora |DOCTOR |
|7194334 |PHONE |
|01/13/93 |DATE |
|Oliveira |DOCTOR |
|1-11-2000 |DATE |
|Cocke County Baptist Hospital|HOSPITAL |
|0295 Keats Street |STREET |
|(302) 786-5227 |PHONE |
|Brothers Coal-Mine |ORGANIZATION|
+-----------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_enriched_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.0 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_el2
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-el2` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_el2_en_4.3.0_3.0_1675123363126.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_el2_en_4.3.0_3.0_1675123363126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_el2","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_el2","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_el2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|59.2 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-el2
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_dl8
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dl8` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl8_en_4.3.0_3.0_1675118817239.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl8_en_4.3.0_3.0_1675118817239.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_dl8","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_dl8","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_dl8|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|164.0 MB|
## References
- https://huggingface.co/google/t5-efficient-small-dl8
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from google)
author: John Snow Labs
name: t5_efficient_base_el2
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-el2` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el2_en_4.3.0_3.0_1675111126562.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el2_en_4.3.0_3.0_1675111126562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_el2","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_el2","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_el2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|339.1 MB|
## References
- https://huggingface.co/google/t5-efficient-base-el2
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Detect PHI for Deidentification (Glove - Subentity)
author: John Snow Labs
name: ner_deid_subentity_glove
date: 2021-06-06
tags: [ner, deid, licensed, en, glove, clinical]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.4
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Absolute) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. It detects 23 entities. This ner model is trained with combination of i2b2 train set and augmented version of i2b2 train set using Glove-100d embeddings.
We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/)
## Predicted Entities
`MEDICALRECORD`, `ORGANIZATION`, `DOCTOR`, `USERNAME`, `PROFESSION`, `HEALTHPLAN`, `URL`, `CITY`, `DATE`, `LOCATION-OTHER`, `STATE`, `PATIENT`, `DEVICE`, `COUNTRY`, `ZIP`, `PHONE`, `HOSPITAL`, `EMAIL`, `IDNUM`, `SREET`, `BIOID`, `FAX`, `AGE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/deid_ner_subentity_glove_en_3.0.4_3.0_1623015533538.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/deid_ner_subentity_glove_en_3.0.4_3.0_1623015533538.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d') \
.setInputCols(['sentence', 'token']) \
.setOutputCol('embeddings')
deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_glove", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk_subentity")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
glove_embeddings,
deid_ner,
ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."""]})))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val glove_embeddings = WordEmbeddingsModel.pretrained("glove_100d")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_glove", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk_subentity")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
glove_embeddings,
deid_ner,
ner_converter))
val result = nlpPipeline.fit(Seq.empty["A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."].toDS.toDF("text")).transform(data)
```
## Results
```bash
+-----------------------------+-------------+
|chunk |ner_label |
+-----------------------------+-------------+
|2093-01-13 |DATE |
|David Hale |DOCTOR |
|Hendrickson, Ora |PATIENT |
|7194334 |MEDICALRECORD|
|01/13/93 |DATE |
|Oliveira |DOCTOR |
|25 |AGE |
|1-11-2000 |DATE |
|Cocke County Baptist Hospital|HOSPITAL |
|0295 Keats Street |STREET |
|+1 (302) 786-5227 |PHONE |
+-----------------------------+-------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_subentity_glove|
|Compatibility:|Healthcare NLP 3.0.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
A custom data set which is created from the i2b2-PHI train and the augmented version of the i2b2-PHI train set is used.
---
layout: model
title: Translate Yoruba to English Pipeline
author: John Snow Labs
name: translate_yo_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, yo, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `yo`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_yo_en_xx_2.7.0_2.4_1609688368244.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_yo_en_xx_2.7.0_2.4_1609688368244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_yo_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_yo_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.yo.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_yo_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for UMLS CUI Codes (Disease or Syndrome)
author: John Snow Labs
name: sbiobertresolve_umls_disease_syndrome
date: 2021-10-11
tags: [entity_resolution, licensed, clinical, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.2.3
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities to UMLS CUI codes. It is trained on `2021AB` UMLS dataset. The complete dataset has 127 different categories, and this model is trained on the `Disease or Syndrome` category using `sbiobert_base_cased_mli` embeddings.
## Predicted Entities
`Predicts UMLS codes for Diseases & Syndromes medical concepts`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_disease_syndrome_en_3.2.3_3.0_1633911418710.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_disease_syndrome_en_3.2.3_3.0_1633911418710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_umls_disease_syndrome``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Cerebrovascular_Disease, Communicable_Disease, Diabetes,Disease_Syndrome_Disorder, Heart_Disease, Hyperlipidemia, Hypertension,Injury_or_Poisoning, Kidney_Disease, Obesity, Oncological, Overweight, Psychological_Condition, Symptom, VS_Finding, ImagingFindings, EKG_Findings``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_umls_disease_syndrome","en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver])
data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text")
results = pipeline.fit(data).transform(data)
```
```scala
...
val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli", "en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_umls_disease_syndrome", "en", "clinical/models")
.setInputCols(Array("ner_chunk_doc", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val p_model = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.").toDF("text")
val res = p_model.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.umls_disease_syndrome").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_rare_puppers", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_rare_puppers", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_rare_puppers|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Translate English to Kinyarwanda Pipeline
author: John Snow Labs
name: translate_en_rw
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, rw, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `rw`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_rw_xx_2.7.0_2.4_1609686452332.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_rw_xx_2.7.0_2.4_1609686452332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_rw", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_rw", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.rw').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_rw|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Fast Neural Machine Translation Model from South Caucasian Languages to English
author: John Snow Labs
name: opus_mt_ccs_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, ccs, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `ccs`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ccs_en_xx_2.7.0_2.4_1609169090699.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ccs_en_xx_2.7.0_2.4_1609169090699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ccs_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ccs_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.ccs.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ccs_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English Bert Embeddings (from anferico)
author: John Snow Labs
name: bert_embeddings_bert_for_patents
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-for-patents` is a English model orginally trained by `anferico`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_for_patents_en_3.4.2_3.0_1649671629607.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_for_patents_en_3.4.2_3.0_1649671629607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_for_patents","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_for_patents","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.bert_for_patents").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_for_patents|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/anferico/bert-for-patents
- https://cloud.google.com/blog/products/ai-machine-learning/how-ai-improves-patent-analysis
- https://services.google.com/fh/files/blogs/bert_for_patents_white_paper.pdf
- https://github.com/google/patents-public-data/blob/master/models/BERT%20for%20Patents.md
- https://github.com/ec-jrc/Patents4IPPC
- https://picampus-school.com/
- https://ec.europa.eu/jrc/en
---
layout: model
title: Spanish RoBERTa Embeddings (Large)
author: John Snow Labs
name: roberta_embeddings_roberta_large_bne
date: 2022-04-14
tags: [roberta, embeddings, es, open_source]
task: Embeddings
language: es
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-large-bne` is a Spanish model orginally trained by `PlanTL-GOB-ES`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_large_bne_es_3.4.2_3.0_1649945069671.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_large_bne_es_3.4.2_3.0_1649945069671.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_large_bne","es") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_large_bne","es")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Me encanta chispa nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.embed.roberta_large_bne").predict("""Me encanta chispa nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_roberta_large_bne|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|es|
|Size:|848.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne
- https://arxiv.org/abs/1907.11692
- http://www.bne.es/en/Inicio/index.html
- http://www.bne.es/en/Inicio/index.html
- https://arxiv.org/abs/1907.11692
- https://github.com/PlanTL-GOB-ES/lm-spanish
- https://arxiv.org/abs/2107.07253
---
layout: model
title: Liability and Contra-Liability NER (Small)
author: John Snow Labs
name: finner_contraliability
date: 2022-12-15
tags: [en, finance, contra, liability, licensed, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a financial model to detect LIABILITY and CONTRA_LIABILITY mentions in texts.
- CONTRA_LIABILITY: Negative liability account that offsets the liability account (e.g. paying a debt)
- LIABILITY: Current or Long-Term Liability (not from stockholders)
Please note this model requires some tokenization configuration to extract the currency (see python snippet below).
## Predicted Entities
`LIABILITY`, `CONTRA_LIABILITY`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_contraliability_en_1.0.0_3.0_1671136444267.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_contraliability_en_1.0.0_3.0_1671136444267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
ner_model = finance.NerModel.pretrained("finner_contraliability", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""Reducing total debt continues to be a top priority , and we remain on track with our target of reducing overall debt levels by $ 15 billion by the end of 2025 ."""]]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("text"),
F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False)
```
## Results
```bash
+---------+------------------+----------+
| token| ner_label|confidence|
+---------+------------------+----------+
| Reducing| O| 0.9997|
| total| B-LIABILITY| 0.7884|
| debt| I-LIABILITY| 0.8479|
|continues| O| 1.0|
| to| O| 1.0|
| be| O| 1.0|
| a| O| 1.0|
| top| O| 1.0|
| priority| O| 1.0|
| ,| O| 1.0|
| and| O| 1.0|
| we| O| 1.0|
| remain| O| 1.0|
| on| O| 1.0|
| track| O| 1.0|
| with| O| 1.0|
| our| O| 1.0|
| target| O| 1.0|
| of| O| 1.0|
| reducing| O| 0.9993|
| overall| O| 0.9969|
| debt|B-CONTRA_LIABILITY| 0.5686|
| levels|I-CONTRA_LIABILITY| 0.6611|
| by| O| 0.9996|
| $| O| 1.0|
| 15| O| 1.0|
| billion| O| 1.0|
| by| O| 1.0|
| the| O| 1.0|
| end| O| 1.0|
| of| O| 1.0|
| 2025| O| 1.0|
| .| O| 1.0|
+---------+------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finner_contraliability|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.2 MB|
## References
In-house annotations on Earning Calls and 10-K Filings combined.
## Benchmarking
```bash
label precision recall f1-score support
B-CONTRA_LIABILITY 0.7660 0.7200 0.7423 50
B-LIABILITY 0.8947 0.8990 0.8969 208
I-CONTRA_LIABILITY 0.7838 0.6304 0.6988 46
I-LIABILITY 0.8780 0.8929 0.8854 411
accuracy - - 0.9805 8299
macro-avg 0.8626 0.8267 0.8429 8299
weighted-avg 0.9803 0.9805 0.9803 8299
```
---
layout: model
title: Recognize Entities DL Pipeline for Norwegian (Bokmal) - Medium
author: John Snow Labs
name: entity_recognizer_md
date: 2021-03-22
tags: [open_source, norwegian_bokmal, entity_recognizer_md, pipeline, "no"]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: "no"
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_no_3.0.0_3.0_1616451623734.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_no_3.0.0_3.0_1616451623734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'no')
annotations = pipeline.fullAnnotate(""Hei fra John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "no")
val result = pipeline.fullAnnotate("Hei fra John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hei fra John Snow Labs! ""]
result_df = nlu.load('no.ner.md').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Hei fra John Snow Labs! '] | ['Hei fra John Snow Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | [[0.1868100017309188,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_md|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|no|
---
layout: model
title: HCP Consult Classifier (BioBERT)
author: John Snow Labs
name: bert_sequence_classifier_vop_hcp_consult
date: 2023-06-13
tags: [licensed, en, classification, vop, clinical, tensorflow]
task: Text Classification
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can identify texts that mention a HCP consult.
## Predicted Entities
`Consulted_By_HCP`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_hcp_consult_en_4.4.3_3.0_1686679279680.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_hcp_consult_en_4.4.3_3.0_1686679279680.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_hcp_consult", "en", "clinical/models")\
.setInputCols(["document",'token'])\
.setOutputCol("prediction")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame(["hi does anybody have feet aches with anxiety, i do suffer from anxiety but never had anything wrong with my feet before",
"My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies."], StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_hcp_consult", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("prediction")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq(Array("hi does anybody have feet aches with anxiety, i do suffer from anxiety but never had anything wrong with my feet before",
"My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies.")).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+-----------------------------------------------------------------------------------------------------------------------+------------------+
|text |result |
+-----------------------------------------------------------------------------------------------------------------------+------------------+
|hi does anybody have feet aches with anxiety, i do suffer from anxiety but never had anything wrong with my feet before|[Other] |
|My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies. |[Consulted_By_HCP]|
+-----------------------------------------------------------------------------------------------------------------------+------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_vop_hcp_consult|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
“Hello,I’m 20 year old girl. I’m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I’m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I’m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I’m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.”
## Benchmarking
```bash
label precision recall f1-score support
Consulted_By_HCP 0.670412 0.730612 0.699219 245
Other 0.848624 0.807860 0.827740 458
accuracy - - 0.780939 703
macro_avg 0.759518 0.769236 0.763480 703
weighted_avg 0.786516 0.780939 0.782950 703
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Ewe
author: John Snow Labs
name: opus_mt_en_ee
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ee, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ee`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ee_xx_2.7.0_2.4_1609167230649.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ee_xx_2.7.0_2.4_1609167230649.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ee", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ee", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ee').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ee|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Termination Clause Binary Classifier (md)
author: John Snow Labs
name: legclf_termination_md
date: 2022-11-25
tags: [en, legal, classification, document, agreement, contract, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `termination`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_termination_md_en_1.0.0_3.0_1669376522291.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_termination_md_en_1.0.0_3.0_1669376522291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[termination]|
|[other]|
|[other]|
|[termination]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_termination_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
precision recall f1-score support
other 0.95 0.97 0.96 39
termination 0.97 0.94 0.95 32
accuracy 0.96 71
macro avg 0.96 0.96 0.96 71
weighted avg 0.96 0.96 0.96 71
```
---
layout: model
title: Chinese Bert Embeddings (from celtics1863)
author: John Snow Labs
name: bert_embeddings_env_bert_chinese
date: 2022-04-11
tags: [bert, embeddings, zh, open_source]
task: Embeddings
language: zh
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `env-bert-chinese` is a Chinese model orginally trained by `celtics1863`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_env_bert_chinese_zh_3.4.2_3.0_1649670664238.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_env_bert_chinese_zh_3.4.2_3.0_1649670664238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_env_bert_chinese","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_env_bert_chinese","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("zh.embed.env_bert_chinese").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_env_bert_chinese|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|384.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/celtics1863/env-bert-chinese
---
layout: model
title: NER Pipeline for 10 African Languages
author: John Snow Labs
name: xlm_roberta_large_token_classifier_masakhaner_pipeline
date: 2022-06-27
tags: [masakhaner, african, xlm_roberta, multilingual, pipeline, amharic, hausa, igbo, kinyarwanda, luganda, swahilu, wolof, yoruba, nigerian, pidgin, xx, open_source]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on [xlm_roberta_large_token_classifier_masakhaner](https://nlp.johnsnowlabs.com/2021/12/06/xlm_roberta_large_token_classifier_masakhaner_xx.html) ner model which is imported from `HuggingFace`.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/Ner_masakhaner/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_masakhaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_masakhaner_pipeline_xx_4.0.0_3.0_1656369154380.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_masakhaner_pipeline_xx_4.0.0_3.0_1656369154380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
masakhaner_pipeline = PretrainedPipeline("xlm_roberta_large_token_classifier_masakhaner_pipeline", lang = "xx")
masakhaner_pipeline.annotate("አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።")
```
```scala
val masakhaner_pipeline = new PretrainedPipeline("xlm_roberta_large_token_classifier_masakhaner_pipeline", lang = "xx")
val masakhaner_pipeline.annotate("አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።")
```
## Results
```bash
+----------------+---------+
|chunk |ner_label|
+----------------+---------+
|አህመድ ቫንዳ |PER |
|ከ3-10-2000 ጀምሮ|DATE |
|በአዲስ አበባ |LOC |
+----------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_large_token_classifier_masakhaner_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|xx|
|Size:|1.8 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- XlmRoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: English DistilBertForQuestionAnswering model (from hiiii23)
author: John Snow Labs
name: distilbert_qa_hiiii23_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hiiii23`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hiiii23_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725446833.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hiiii23_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725446833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hiiii23_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hiiii23_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_hiiii23").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_hiiii23_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/hiiii23/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_4_h_512
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-512` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_512_zh_4.2.4_3.0_1670325937593.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_512_zh_4.2.4_3.0_1670325937593.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_512","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_512","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_4_h_512|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|90.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-4_H-512
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Spanish Named Entity Recognition (Base, CAPITEL competition at IberLEF 2020 dataset)
author: John Snow Labs
name: roberta_ner_roberta_base_bne_capitel_ner
date: 2022-05-03
tags: [roberta, ner, open_source, es]
task: Named Entity Recognition
language: es
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-bne-capitel-ner` is a Spanish model orginally trained by `PlanTL-GOB-ES`.
## Predicted Entities
`ORG`, `LOC`, `PER`, `OTH`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_base_bne_capitel_ner_es_3.4.2_3.0_1651593219771.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_base_bne_capitel_ner_es_3.4.2_3.0_1651593219771.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_base_bne_capitel_ner","es") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_base_bne_capitel_ner","es")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Amo Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.ner.roberta_base_bne_capitel_ner").predict("""Amo Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_ner_roberta_base_bne_capitel_ner|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|457.2 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner
- https://arxiv.org/abs/1907.11692
- http://www.bne.es/en/Inicio/index.html
- https://sites.google.com/view/capitel2020
- https://github.com/PlanTL-GOB-ES/lm-spanish
- https://arxiv.org/abs/2107.07253
---
layout: model
title: Detect PHI for Deidentification purposes (Portuguese)
author: John Snow Labs
name: ner_deid_subentity
date: 2022-04-13
tags: [deid, deidentification, pt, licensed]
task: De-identification
language: pt
edition: Healthcare NLP 3.4.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Portuguese) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 15 entities.
This NER model is trained with a combination of custom datasets with several data augmentation mechanisms.
## Predicted Entities
`PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `SEX`, `EMAIL`, `ZIP`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pt_3.4.2_3.0_1649840643338.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pt_3.4.2_3.0_1649840643338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "pt")\
.setInputCols(["sentence","token"])\
.setOutputCol("word_embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_subentity", "pt", "clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter])
text = ['''
Detalhes do paciente.
Nome do paciente: Pedro Gonçalves
NHC: 2569870.
Endereço: Rua Das Flores 23.
Cidade/ Província: Porto.
Código Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 53 anos Sexo: Homen
Data de admissão: 17/06/2016.
Doutora: Maria Santos
''']
data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "pt")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "pt", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter))
val text = """Detalhes do paciente.
Nome do paciente: Pedro Gonçalves
NHC: 2569870.
Endereço: Rua Das Flores 23.
Cidade/ Província: Porto.
Código Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 53 anos Sexo: Homen
Data de admissão: 17/06/2016.
Doutora: Maria Santos"""
val data = Seq(text).toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("pt.med_ner.deid.subentity").predict("""
Detalhes do paciente.
Nome do paciente: Pedro Gonçalves
NHC: 2569870.
Endereço: Rua Das Flores 23.
Cidade/ Província: Porto.
Código Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 53 anos Sexo: Homen
Data de admissão: 17/06/2016.
Doutora: Maria Santos
""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ka") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["მე მიყვარს ნაპერწკალი NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ka")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("მე მიყვარს ნაპერწკალი NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ka.embed.w2v_cc_300d").predict("""მე მიყვარს ნაპერწკალი NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|ka|
|Size:|909.1 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Table recognition
author: John Snow Labs
name: table_recognition
date: 2023-01-03
tags: [en, licensed, ocr, table_recognition]
task: Table Recognition
language: en
nav_key: models
edition: Visual NLP 4.1.0
spark_version: 3.3.0
supported: true
annotator: TableRecognition
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model shows the capabilities for table recognition and free-text extraction using OCR techniques.
For table recognition is proposed a CascadTabNet model.
CascadeTabNet is a machine learning model for table detection in document images. It is based on a cascaded architecture, which is a two-stage process where the model first detects candidate regions that may contain tables, and then classifies these regions as tables or non-tables. The model is trained using a dataset of document images, where the tables have been manually annotated.
The benchmark results show that the model is able to detect tables in document images with high accuracy.
On the ICDAR2013 table competition dataset, CascadeTabNet achieved an F1-score of 0.85, which is considered a good score in this dataset. On the COCO-Text dataset, the model achieved a precision of 0.82 and a recall of 0.79, which are also considered good scores. In addition, the model has been evaluated on the UNLV dataset, where it achieved a precision of 0.87 and a recall of 0.83.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/ocr/IMAGE_TABLE_DETECTION/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/2.2.Spark_OCR_training_Table_recognition.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
binary_to_image = BinaryToImage() \
.setInputCol("content") \
.setOutputCol("image") \
.setImageType(ImageType.TYPE_3BYTE_BGR)
# Detect tables on the page using pretrained model
# It can be finetuned for have more accurate results for more specific documents
table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr") \
.setInputCol("image") \
.setOutputCol("region")
# Draw detected region's with table to the page
draw_regions = ImageDrawRegions() \
.setInputCol("image") \
.setInputRegionsCol("region") \
.setOutputCol("image_with_regions") \
.setRectColor(Color.red)
# Extract table regions to separate images
splitter = ImageSplitRegions() \
.setInputCol("image") \
.setInputRegionsCol("region") \
.setOutputCol("table_image") \
.setDropCols("image")
# Detect cells on the table image
cell_detector = ImageTableCellDetector() \
.setInputCol("table_image") \
.setOutputCol("cells") \
.setAlgoType("morphops") \
.setDrawDetectedLines(True)
# Extract text from the detected cells
table_recognition = ImageCellsToTextTable() \
.setInputCol("table_image") \
.setCellsCol('cells') \
.setMargin(3) \
.setStrip(True) \
.setOutputCol('table')
# Erase detected table regions
fill_regions = ImageDrawRegions() \
.setInputCol("image") \
.setInputRegionsCol("region") \
.setOutputCol("image_1") \
.setRectColor(Color.white) \
.setFilledRect(True)
# OCR
ocr = ImageToText() \
.setInputCol("image_1") \
.setOutputCol("text") \
.setOcrParams(["preserve_interword_spaces=1", ]) \
.setKeepLayout(True) \
.setOutputSpaceCharacterWidth(8)
pipeline_table = PipelineModel(stages=[
binary_to_image,
table_detector,
draw_regions,
fill_regions,
splitter,
cell_detector,
table_recognition,
ocr
])
imagePath = "/content/cTDaR_t10096.jpg"
df = spark.read.format("binaryFile").load(imagePath)
tables_results = pipeline_table.transform(df).cache()
```
```scala
val binary_to_image = new BinaryToImage()
.setInputCol("content")
.setOutputCol("image")
.setImageType(ImageType.TYPE_3BYTE_BGR)
# Detect tables on the page using pretrained model
# It can be finetuned for have more accurate results for more specific documents
val table_detector = ImageTableDetector
.pretrained("general_model_table_detection_v2", "en", "clinical/ocr")
.setInputCol("image")
.setOutputCol("region")
# Draw detected region's with table to the page
val draw_regions = new ImageDrawRegions()
.setInputCol("image")
.setInputRegionsCol("region")
.setOutputCol("image_with_regions")
.setRectColor(Color.red)
# Extract table regions to separate images
val splitter = new ImageSplitRegions()
.setInputCol("image")
.setInputRegionsCol("region")
.setOutputCol("table_image")
.setDropCols("image")
# Detect cells on the table image
val cell_detector = new ImageTableCellDetector()
.setInputCol("table_image")
.setOutputCol("cells")
.setAlgoType("morphops")
.setDrawDetectedLines(True)
# Extract text from the detected cells
val table_recognition = new ImageCellsToTextTable()
.setInputCol("table_image")
.setCellsCol("cells")
.setMargin(3)
.setStrip(True)
.setOutputCol("table")
# Erase detected table regions
val fill_regions = new ImageDrawRegions()
.setInputCol("image")
.setInputRegionsCol("region")
.setOutputCol("image_1")
.setRectColor(Color.white)
.setFilledRect(True)
# OCR
val ocr = new ImageToText()
.setInputCol("image_1")
.setOutputCol("text")
.setOcrParams(Array("preserve_interword_spaces=1", ))
.setKeepLayout(True)
.setOutputSpaceCharacterWidth(8)
val pipeline_table = new PipelineModel().setStages(Array(
binary_to_image,
table_detector,
draw_regions,
fill_regions,
splitter,
cell_detector,
table_recognition,
ocr))
val imagePath = "/content/cTDaR_t10096.jpg"
val df = spark.read.format("binaryFile").load(imagePath)
val tables_results = pipeline_table.transform(df).cache()
```
## Example
### Input Image

### Table Structure Recognition
{%- capture td_image -%}

{%- endcapture -%}
{%- capture tsr_detection -%}

{%- endcapture -%}
{% include templates/input_output_image.md
input_image=td_image
output_image=tsr_detection
%}
## Model Information
{:.table-model}
|---|---|
|Model Name:|table_recognition|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from huxxx657)
author: John Snow Labs
name: roberta_qa_base_finetuned_scrambled_squad_10
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-10` is a English model originally trained by `huxxx657`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_10_en_4.3.0_3.0_1674216712953.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_10_en_4.3.0_3.0_1674216712953.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_10","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_10","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_finetuned_scrambled_squad_10|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-10
---
layout: model
title: Legal Natural And Applied Sciences Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_natural_and_applied_sciences_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, natural_and_applied_sciences, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_natural_and_applied_sciences_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Natural_and_Applied_Sciences or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Natural_and_Applied_Sciences`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_natural_and_applied_sciences_bert_en_1.0.0_3.0_1678111577098.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_natural_and_applied_sciences_bert_en_1.0.0_3.0_1678111577098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Natural_and_Applied_Sciences]|
|[Other]|
|[Other]|
|[Natural_and_Applied_Sciences]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_natural_and_applied_sciences_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Natural_and_Applied_Sciences 0.95 0.90 0.92 99
Other 0.90 0.95 0.92 91
accuracy - - 0.92 190
macro-avg 0.92 0.92 0.92 190
weighted-avg 0.92 0.92 0.92 190
```
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from Li)
author: John Snow Labs
name: roberta_qa_li_base_squad2
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `Li`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_li_base_squad2_en_4.3.0_3.0_1674218901284.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_li_base_squad2_en_4.3.0_3.0_1674218901284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_li_base_squad2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_li_base_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_li_base_squad2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|462.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Li/roberta-base-squad2
- https://rajpurkar.github.io/SQuAD-explorer
- https://rajpurkar.github.io/SQuAD-explorer/
---
layout: model
title: Explain Document pipeline for Danish (explain_document_lg)
author: John Snow Labs
name: explain_document_lg
date: 2021-03-23
tags: [open_source, danish, explain_document_lg, pipeline, da]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: da
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_da_3.0.0_3.0_1616524893608.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_da_3.0.0_3.0_1616524893608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('explain_document_lg', lang = 'da')
annotations = pipeline.fullAnnotate(""Hej fra John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_lg", lang = "da")
val result = pipeline.fullAnnotate("Hej fra John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hej fra John Snow Labs! ""]
result_df = nlu.load('da.explain.lg').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | lemma | pos | embeddings | ner | entities |
|---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------|
| 0 | ['Hej fra John Snow Labs! '] | ['Hej fra John Snow Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[-0.025171000510454,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_lg|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|da|
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6_en_4.0.0_3.0_1655731850822.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6_en_4.0.0_3.0_1655731850822.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|415.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-6
---
layout: model
title: Translate English to Vietnamese Pipeline
author: John Snow Labs
name: translate_en_vi
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, vi, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `vi`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_vi_xx_2.7.0_2.4_1609698576382.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_vi_xx_2.7.0_2.4_1609698576382.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_vi", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_vi", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.vi').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_vi|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Arabic ElectraForQuestionAnswering model (from aymanm419) Version-2
author: John Snow Labs
name: electra_qa_araElectra_SQUAD_ARCD_768
date: 2022-06-22
tags: [ar, open_source, electra, question_answering]
task: Question Answering
language: ar
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `araElectra-SQUAD-ARCD-768` is a Arabic model originally trained by `aymanm419`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_araElectra_SQUAD_ARCD_768_ar_4.0.0_3.0_1655920218769.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_araElectra_SQUAD_ARCD_768_ar_4.0.0_3.0_1655920218769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_araElectra_SQUAD_ARCD_768","ar") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_araElectra_SQUAD_ARCD_768","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.answer_question.squad_arcd.electra.768d").predict("""ما هو اسمي؟|||"اسمي كلارا وأنا أعيش في بيركلي.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_araElectra_SQUAD_ARCD_768|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|ar|
|Size:|504.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aymanm419/araElectra-SQUAD-ARCD-768
---
layout: model
title: Forward-Looking Statements Classification
author: John Snow Labs
name: finclf_bert_fls
date: 2022-09-06
tags: [en, finance, forward, looking, statements, fls, licensed]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Text Classification model aimed to detect at sentence or paragraph level, if there is a Forward-looking statements (FLS).
FLS are beliefs and opinions about firm's future events or results, usually present in documents as Financial Reports. Identifying forward-looking statements from corporate reports can assist investors in financial analysis.
This model was trained originally on 3,500 manually annotated sentences from Management Discussion and Analysis section of annual reports of Russell 3000 firms and then finetuned in house by JSL on low-performant examples.
## Predicted Entities
`Specific FLS`, `Non-specific FLS`, `Not FLS`
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_bert_fls_en_1.0.0_3.2_1662468990598.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_bert_fls_en_1.0.0_3.2_1662468990598.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = nlp.Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = finance.BertForSequenceClassification.pretrained("finclf_bert_fls", "en", "finance/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
# couple of simple examples
example = spark.createDataFrame([["Global economy will increase during the next year."]]).toDF("text")
result = pipeline.fit(example).transform(example)
# result is a DataFrame
result.select("text", "class.result").show()
```
## Results
```bash
+--------------------+--------------+
| text| result|
+--------------------+--------------+
|Global economy wi...|[Specific FLS]|
+--------------------+--------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_bert_fls|
|Type:|finance|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|412.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
In-house annotations on 10K financial reports and reports from Russell 3000 firms
## Benchmarking
```bash
label precision recall f1-score support
Specific_FLS 0.96 0.93 0.94 311
Non-specific_FLS 0.91 0.94 0.92 215
Not_FLS 0.84 0.87 0.85 70
accuracy - - 0.92 596
macro-avg 0.90 0.91 0.91 596
weighted-avg 0.93 0.92 0.92 596
```
---
layout: model
title: Pipeline to Detect PHI for Deidentification in Romanian (Word2Vec)
author: John Snow Labs
name: ner_deid_subentity_pipeline
date: 2023-03-09
tags: [ner, deidentification, word2vec, ro, licensed]
task: Named Entity Recognition
language: ro
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_w2v_subentity_ro_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_ro_4.3.0_3.2_1678386065654.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_ro_4.3.0_3.2_1678386065654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_subentity_pipeline", "ro", "clinical/models")
text = '''Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_deid_subentity_pipeline", "ro", "clinical/models")
val text = "Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:-----------------------------|--------:|------:|:------------|-------------:|
| 0 | Spitalul Pentru Ochi de Deal | 0 | 27 | HOSPITAL | 0.5594 |
| 1 | Drumul Oprea Nr. 972 | 30 | 49 | STREET | 0.99724 |
| 2 | Vaslui | 51 | 56 | CITY | 1 |
| 3 | 737405 | 59 | 64 | ZIP | 1 |
| 4 | +40(235)413773 | 79 | 92 | PHONE | 1 |
| 5 | 25 May 2022 | 119 | 129 | DATE | 1 |
| 6 | BUREAN MARIA | 158 | 169 | PATIENT | 0.9515 |
| 7 | 77 | 180 | 181 | AGE | 1 |
| 8 | Agota Evelyn Tımar | 191 | 208 | DOCTOR | 0.8149 |
| 9 | 2450502264401 | 218 | 230 | IDNUM | 1 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_subentity_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|ro|
|Size:|1.2 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English RobertaForQuestionAnswering (from squirro)
author: John Snow Labs
name: roberta_qa_distilroberta_base_squad_v2
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-squad_v2` is a English model originally trained by `squirro`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_squad_v2_en_4.0.0_3.0_1655728374337.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_squad_v2_en_4.0.0_3.0_1655728374337.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilroberta_base_squad_v2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_distilroberta_base_squad_v2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.distilled_base_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_distilroberta_base_squad_v2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|307.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/squirro/distilroberta-base-squad_v2
- https://paperswithcode.com/sota?task=Question+Answering&dataset=The+Stanford+Question+Answering+Dataset
- https://www.linkedin.com/showcase/the-squirro-academy
- https://twitter.com/Squirro
- https://www.instagram.com/squirro/
- http://squirro.com
- https://www.linkedin.com/company/squirroag
- https://www.facebook.com/squirro
- https://rajpurkar.github.io/SQuAD-explorer/
---
layout: model
title: Portuguese Large Legal Bert Embeddings
author: John Snow Labs
name: bert_embeddings_bert_large_cased_pt_lenerbr
date: 2022-04-11
tags: [bert, embeddings, pt, open_source]
task: Embeddings
language: pt
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-cased-pt-lenerbr` is a Portuguese model orginally trained by `pierreguillou`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_cased_pt_lenerbr_pt_3.4.2_3.0_1649673910638.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_cased_pt_lenerbr_pt_3.4.2_3.0_1649673910638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_cased_pt_lenerbr","pt") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_cased_pt_lenerbr","pt")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Eu amo Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("pt.embed.bert_large_cased_pt_lenerbr").predict("""Eu amo Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_large_cased_pt_lenerbr|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|pt|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/pierreguillou/bert-large-cased-pt-lenerbr
- https://medium.com/@pierre_guillou/nlp-modelos-e-web-app-para-reconhecimento-de-entidade-nomeada-ner-no-dom%C3%ADnio-jur%C3%ADdico-b658db55edfb
- https://github.com/piegu/language-models/blob/master/Finetuning_language_model_BERtimbau_LeNER_Br.ipynb
- https://paperswithcode.com/sota?task=Fill+Mask&dataset=pierreguillou%2Flener_br_finetuning_language_model
---
layout: model
title: Pipeline to Detect Cellular/Molecular Biology Entities
author: John Snow Labs
name: bert_token_classifier_ner_cellular_pipeline
date: 2022-03-09
tags: [cellular, ner, bert_for_token_classifier, en, licensed, clinical]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_cellular](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_cellular_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CELLULAR/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CELLULAR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_pipeline_en_3.4.1_3.0_1646826493144.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_pipeline_en_3.4.1_3.0_1646826493144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
cellular_pipeline = PretrainedPipeline("bert_token_classifier_ner_cellular_pipeline", "en", "clinical/models")
cellular_pipeline.fullAnnotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val cellular_pipeline = new PretrainedPipeline("bert_token_classifier_ner_cellular_pipeline", "en", "clinical/models")
cellular_pipeline.fullAnnotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.cellular_pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""")
```
## Results
```bash
+-----------------------------------------------------------+---------+
|chunk |ner_label|
+-----------------------------------------------------------+---------+
|intracellular signaling proteins |protein |
|human T-cell leukemia virus type 1 promoter |DNA |
|Tax |protein |
|Tax-responsive element 1 |DNA |
|cyclic AMP-responsive members |protein |
|CREB/ATF family |protein |
|transcription factors |protein |
|Tax |protein |
|human T-cell leukemia virus type 1 Tax-responsive element 1|DNA |
|TRE-1 |DNA |
|lacZ gene |DNA |
|CYC1 promoter |DNA |
|TRE-1 |DNA |
|cyclic AMP response element-binding protein |protein |
|CREB |protein |
|CREB |protein |
|GAL4 activation domain |protein |
|GAD |protein |
|reporter gene |DNA |
|Tax |protein |
+-----------------------------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_cellular_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.4 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverter
- Finisher
---
layout: model
title: Classifier for Genders - BIOBERT
author: John Snow Labs
name: classifierdl_gender_biobert
date: 2020-12-16
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 2.6.5
spark_version: 2.4
tags: [classifier, en, clinical, licensed]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model classifies the gender of the patient in the clinical document.
{:.h2_title}
## Predicted Entities
`Female`, ``Male`, `Unknown`.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLINICAL_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_gender_biobert_en_2.6.4_2.4_1608119684447.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_gender_biobert_en_2.6.4_2.4_1608119684447.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
To classify your text, you can use this model as part of an nlp pipeline with the following stages: DocumentAssembler, BertSentenceEmbeddings (`biobert_pubmed_base_cased`), ClassifierDLModel.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
biobert_embeddings = BertEmbeddings().pretrained('biobert_pubmed_base_cased') \
.setInputCols(["document","token"])\
.setOutputCol("bert_embeddings")
sentence_embeddings = SentenceEmbeddings() \
.setInputCols(["document", "bert_embeddings"]) \
.setOutputCol("sentence_bert_embeddings") \
.setPoolingStrategy("AVERAGE")
genderClassifier = ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \
.setInputCols(["document", "sentence_bert_embeddings"]) \
.setOutputCol("gender")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, biobert_embeddings, sentence_embeddings, gender_classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val biobert_embeddings = BertEmbeddings().pretrained("biobert_pubmed_base_cased")
.setInputCols(Array("document","token"))
.setOutputCol("bert_embeddings")
val sentence_embeddings = SentenceEmbeddings()
.setInputCols(Array("document", "bert_embeddings"))
.setOutputCol("sentence_bert_embeddings")
.setPoolingStrategy("AVERAGE")
val genderClassifier = ClassifierDLModel.pretrained("classifierdl_gender_biobert", "en", "clinical/models")
.setInputCols(Array("document", "sentence_bert_embeddings"))
.setOutputCol("gender")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, biobert_embeddings, sentence_embeddings, gender_classifier))
val data = Seq("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.gender.biobert").predict("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""")
```
{:.h2_title}
## Results
```bash
Female
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_gender_biobert|
|Type:|ClassifierDLModel|
|Compatibility:|Healthcare NLP 2.6.5 +|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|[en]|
|Case sensitive:|True|
{:.h2_title}
## Data Source
This model is trained on more than four thousands clinical documents (radiology reports, pathology reports, clinical visits etc.), annotated internally.
{:.h2_title}
## Benchmarking
```bash
label precision recall f1-score support
Female 0.9224 0.8954 0.9087 239
Male 0.7895 0.8468 0.8171 124
Unknown 0.8077 0.7778 0.7925 54
accuracy 0.8657 417
macro-avg 0.8399 0.8400 0.8394 417
weighted-avg 0.8680 0.8657 0.8664 417
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Celtic Languages
author: John Snow Labs
name: opus_mt_en_cel
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, cel, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `cel`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_cel_xx_2.7.0_2.4_1609163584650.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_cel_xx_2.7.0_2.4_1609163584650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_cel", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_cel", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.cel').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_cel|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from mrm8488)
author: John Snow Labs
name: roberta_qa_base_1b_1_finetuned_squadv2
date: 2022-12-02
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-1B-1-finetuned-squadv2` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_1b_1_finetuned_squadv2_en_4.2.4_3.0_1669985564517.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_1b_1_finetuned_squadv2_en_4.2.4_3.0_1669985564517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_1b_1_finetuned_squadv2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_1b_1_finetuned_squadv2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_1b_1_finetuned_squadv2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|447.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mrm8488/roberta-base-1B-1-finetuned-squadv2
- https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/
- https://twitter.com/mrm8488
- https://www.linkedin.com/in/manuel-romero-cs/
---
layout: model
title: Translate Polish to English Pipeline
author: John Snow Labs
name: translate_pl_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, pl, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `pl`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_pl_en_xx_2.7.0_2.4_1609690625108.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_pl_en_xx_2.7.0_2.4_1609690625108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_pl_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_pl_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.pl.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_pl_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal NER for NDA (Remedies Clauses)
author: John Snow Labs
name: legner_nda_remedies
date: 2023-04-16
tags: [en, licensed, ner, legal, nda, remedies]
task: Named Entity Recognition
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a NER model, aimed to be run **only** after detecting the `REMEDIES` clause with a proper classifier (use `legmulticlf_mnda_sections_paragraph_other` for that purpose). It will extract the following entities: `CURRENCY`, `NUMERIC_REMEDY`, and `REMEDY_TYPE`.
## Predicted Entities
`CURRENCY`, `NUMERIC_REMEDY`, `REMEDY_TYPE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_remedies_en_1.0.0_3.0_1681687124993.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_remedies_en_1.0.0_3.0_1681687124993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_nda_remedies", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""The breaching party shall pay the non-breaching party liquidated damages of $ 1,000 per day for each day that the breach of this NDA continues."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+------------------+--------------+
|chunk |ner_label |
+------------------+--------------+
|liquidated damages|REMEDY_TYPE |
|$ |CURRENCY |
|1,000 |NUMERIC_REMEDY|
+------------------+--------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_nda_remedies|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.3 MB|
## References
In-house annotations on the Non-disclosure Agreements
## Benchmarking
```bash
label precision recall f1-score support
CURRENCY 1.00 1.00 1.00 11
NUMERIC_REMEDY 1.00 1.00 1.00 11
REMEDY_TYPE 0.86 0.94 0.90 32
micro-avg 0.91 0.96 0.94 54
macro-avg 0.95 0.98 0.97 54
weighted-avg 0.92 0.96 0.94 54
```
---
layout: model
title: Fast Neural Machine Translation Model from Atlantic-Congo Languages to English
author: John Snow Labs
name: opus_mt_alv_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, alv, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `alv`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_alv_en_xx_2.7.0_2.4_1609169974046.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_alv_en_xx_2.7.0_2.4_1609169974046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_alv_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_alv_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.alv.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_alv_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Pipeline to Detect Social Determinants of Health Mentions
author: John Snow Labs
name: ner_sdoh_mentions_pipeline
date: 2023-03-08
tags: [en, licensed, ner, sdoh, mentions, clinical]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_sdoh_mentions](https://nlp.johnsnowlabs.com/2022/12/18/ner_sdoh_mentions_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_mentions_pipeline_en_4.3.0_3.2_1678281267173.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_mentions_pipeline_en_4.3.0_3.2_1678281267173.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_sdoh_mentions_pipeline", "en", "clinical/models")
text = '''Mr. Known lastname 9880 is a pleasant, cooperative gentleman with a long standing history (20 years) diverticulitis. He is married and has 3 children. He works in a bank. He denies any alcohol or intravenous drug use. He has been smoking for many years.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_sdoh_mentions_pipeline", "en", "clinical/models")
val text = "Mr. Known lastname 9880 is a pleasant, cooperative gentleman with a long standing history (20 years) diverticulitis. He is married and has 3 children. He works in a bank. He denies any alcohol or intravenous drug use. He has been smoking for many years."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | chunks | begin | end | entities | confidence |
|---:|:-----------------|--------:|------:|:-----------------|-------------:|
| 0 | married | 123 | 129 | sdoh_community | 0.9972 |
| 1 | children | 141 | 148 | sdoh_community | 0.9999 |
| 2 | works | 154 | 158 | sdoh_economics | 0.9995 |
| 3 | alcohol | 185 | 191 | behavior_alcohol | 0.9925 |
| 4 | intravenous drug | 196 | 211 | behavior_drug | 0.9803 |
| 5 | smoking | 230 | 236 | behavior_tobacco | 0.9997 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_sdoh_mentions_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Yoruba Named Entity Recognition (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba
date: 2022-05-17
tags: [xlm_roberta, ner, token_classification, yo, open_source]
task: Named Entity Recognition
language: yo
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-yoruba` is a Yoruba model orginally trained by `mbeukman`.
## Predicted Entities
`PER`, `ORG`, `LOC`, `DATE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba_yo_3.4.2_3.0_1652808736465.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba_yo_3.4.2_3.0_1652808736465.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba","yo") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Mo nifẹ Snark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba","yo")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Mo nifẹ Snark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|yo|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-yoruba
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://www.apache.org/licenses/LICENSE-2.0
- https://github.com/Michael-Beukma
---
layout: model
title: NER on Capital Calls (Small)
author: John Snow Labs
name: finner_capital_calls
date: 2023-02-01
tags: [capital, calls, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: FinanceNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a `small` capital call NER, trained to extract contact and financial information from Capital Call Notices. These are the entities retrieved by the model:
```
Financial information:
FUND: Name of the Fund called
ORG: Organization asking the Fund for the Capital
AMOUNT: Amount called by ORG to FUND
DUE_DATE: Due date of the call
ACCOUNT_NAME: Organization's Bank Account Name
ACCOUNT_NUMBER: Organization's Bank Account Number
ABA: Routing Number (ABA)
BANK_ADDRESS: Contact address of the Bank
Contact information:
PHONE: Contact Phone
PERSON: Contact Person
BANK_CONTACT: Person to contact in Bank
EMAIL: Contact Email
Other additional information, not directly involved in the call:
OTHER_PERSON: Other people detected (People signing the call, people to whom is addressed the Notice, etc)
OTHER_PERCENTAGE: Percentages mentiones
OTHER_DATE: Other dates mentioned, not Due Date
OTHER_AMOUNT: Other amounts mentioned
OTHER_ADDRESS: Other addresses mentiones
OTHER_ORG: Other ORG mentiones
```
## Predicted Entities
`FUND`, `ORG`, `AMOUNT`, `DUE_DATE`, `ACCOUNT_NAME`, `ACCOUNT_NUMBER`, `BANK_ADDRESS`, `PHONE`, `PERSON`, `BANK_CONTACT`, `EMAIL`, `OTHER_PERSON`, `OTHER_PERCENTAGE`, `OTHER_DATE`, `OTHER_AMOUNT`, `OTHER_ADDRESS`, `OTHER_ORG`, `ABA`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_CAPITAL_CALLS){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_capital_calls_en_1.0.0_3.0_1675250939298.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_capital_calls_en_1.0.0_3.0_1675250939298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from pyspark.sql import functions as F
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
ner = finance.NerModel.pretrained('finner_capital_calls', 'en', 'finance/models')\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
converter = finance.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
pipeline = nlp.Pipeline(stages=[documentAssembler,
sentence,
tokenizer,
embeddings,
ner,
converter
])
df = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(df)
lp = nlp.LightPipeline(model)
text = """Dear Charlotte R. Davis,
We hope this message finds you well. This is to inform you that a capital call for Upfront Ventures has been initiated. The amount requested is 7000 EUR and is due on 01.01.2024.
Kindly wire transfer the funds to the following account:
Account Green Planet Solutions LLC
Account Number 1234567-1XX
Routing Number 51903761
Bank First Republic Bank
If you require any further information, please do not hesitate to reach out to us at 3055 550818 or coxeric@example.com.
Thank you for your prompt attention to this matter.
Best regards,
James Wilson"""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
from pyspark.sql import functions as F
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label"),
F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_recruit","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_recruit","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.longformer.by_manishiitg").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|longformer_qa_recruit|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|556.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/manishiitg/longformer-recruit-qa
---
layout: model
title: English BertForTokenClassification Cased model (from nguyenkhoa2407)
author: John Snow Labs
name: bert_token_classifier_autotrain_ner_favsbot
date: 2022-11-30
tags: [en, open_source, bert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-bert-NER-favsbot` is a English model originally trained by `nguyenkhoa2407`.
## Predicted Entities
`TIME`, `SORT`, `PER`, `LOC`, `TAG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_ner_favsbot_en_4.2.4_3.0_1669814444554.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_ner_favsbot_en_4.2.4_3.0_1669814444554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_ner_favsbot","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_ner_favsbot","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_autotrain_ner_favsbot|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/nguyenkhoa2407/autotrain-bert-NER-favsbot
---
layout: model
title: Adverse Drug Events Classifier (DistilBERT)
author: John Snow Labs
name: distilbert_sequence_classifier_ade
date: 2022-02-08
tags: [bert, sequence_classification, en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: MedicalDistilBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Classify text/sentence in two categories:
`True` : The sentence is talking about a possible ADE
`False` : The sentences doesn’t have any information about an ADE.
This model is a [DistilBERT](https://huggingface.co/distilbert-base-cased)-based classifier. Please note that there is no bio-version of DistilBERT so the performance may not be par with BioBERT-based classifiers.
## Predicted Entities
`True`, `False`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_ADE/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/08.3.MedicalBertForSequenceClassification_in_SparkNLP.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/distilbert_sequence_classifier_ade_en_3.4.1_3.0_1644352732829.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/distilbert_sequence_classifier_ade_en_3.4.1_3.0_1644352732829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier = MedicalDistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame([["I felt a bit drowsy and had blurred vision after taking Aspirin."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val sequenceClassifier = MedicalDistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val data = Seq("I felt a bit drowsy and had blurred vision after taking Aspirin.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.ade.seq_distilbert").predict("""I felt a bit drowsy and had blurred vision after taking Aspirin.""")
```
## Results
```bash
+----------------------------------------------------------------+------+
|text |result|
+----------------------------------------------------------------+------+
|I felt a bit drowsy and had blurred vision after taking Aspirin.|[True]|
+----------------------------------------------------------------+------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_sequence_classifier_ade|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.3 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
This model is trained on a custom dataset comprising of CADEC, DRUG-AE and Twimed.
## Benchmarking
```bash
label precision recall f1-score support
False 0.93 0.93 0.93 6935
True 0.64 0.65 0.65 1347
accuracy 0.88 0.88 0.88 8282
macro-avg 0.79 0.79 0.79 8282
weighted-avg 0.89 0.88 0.89 8282
```
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from anablasi)
author: John Snow Labs
name: roberta_qa_model_10k
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `model_10k_qa` is a English model originally trained by `anablasi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_model_10k_en_4.3.0_3.0_1674211482662.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_model_10k_en_4.3.0_3.0_1674211482662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_model_10k","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_model_10k","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_model_10k|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|467.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anablasi/model_10k_qa
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-128-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6_en_4.0.0_3.0_1657184326520.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6_en_4.0.0_3.0_1657184326520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-128-finetuned-squad-seed-6
---
layout: model
title: English DistilBertForQuestionAnswering model (from threem) Squad1
author: John Snow Labs
name: distilbert_qa_mysquadv2_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mysquadv2-finetuned-squad` is a English model originally trained by `threem`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_mysquadv2_finetuned_squad_en_4.0.0_3.0_1654728438175.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_mysquadv2_finetuned_squad_en_4.0.0_3.0_1654728438175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mysquadv2_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mysquadv2_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.distil_bert.by_threem").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_mysquadv2_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/threem/mysquadv2-finetuned-squad
---
layout: model
title: Detect Oncology-Specific Entities
author: John Snow Labs
name: ner_oncology_limited_80p_for_benchmarks
date: 2023-04-03
tags: [licensed, clinical, en, oncology, biomarker, treatment]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
`Important Note:` This model is trained with a partial dataset that is used to train [ner_oncology](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_en.html); and meant to be used for benchmarking run at [LLMs Healthcare Benchmarks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/academic/LLMs_in_Healthcare).
This model extracts more than 40 oncology-related entities, including therapies, tests and staging.
Definitions of Predicted Entities:
`Adenopathy`: Mentions of pathological findings of the lymph nodes.
`Age`: All mention of ages, past or present, related to the patient or with anybody else.
`Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category.
`Biomarker_Result`: Terms or values that are identified as the result of a biomarkers.
`Cancer_Dx`: Mentions of cancer diagnoses (such as “breast cancer”) or pathological types that are usually used as synonyms for “cancer” (e.g. “carcinoma”). When anatomical references are present, they are included in the Cancer_Dx extraction.
`Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. “BI-RADS” or “Allred score”).
`Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment.
`Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as “chemotherapy”.
`Cycle_Coun`: The total number of cycles being administered of an oncological therapy (e.g. “5 cycles”).
`Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. “day 5”).
`Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. “third cycle”).
`Date`: Mentions of exact dates, in any format, including day number, month and/or year.
`Death_Entity`: Words that indicate the death of the patient or someone else (including family members), such as “died” or “passed away”.
`Direction`: Directional and laterality terms, such as “left”, “right”, “bilateral”, “upper” and “lower”.
`Dosage`: The quantity prescribed by the physician for an active ingredient.
`Duration`: Words indicating the duration of a treatment (e.g. “for 2 weeks”).
`Frequency`: Words indicating the frequency of treatment administration (e.g. “daily” or “bid”).
`Gender`: Gender-specific nouns and pronouns (including words such as “him” or “she”, and family members such as “father”).
`Grade`: All pathological grading of tumors (e.g. “grade 1”) or degrees of cellular differentiation (e.g. “well-differentiated”)
`Histological_Type`: Histological variants or cancer subtypes, such as “papillary”, “clear cell” or “medullary”.
`Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as “hormonal therapy”.
`Imaging_Test`: Imaging tests mentioned in texts, such as “chest CT scan”.
`Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as “immunotherapy”.
`Invasion`: Mentions that refer to tumor invasion, such as “invasion” or “involvement”. Metastases or lymph node involvement are excluded from this category.
`Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. “first-line treatment”).
`Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions.
`Oncogene`: Mentions of genes that are implicated in the etiology of cancer.
`Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. “malignant ductal cells”).
`Pathology_Test`: Mentions of biopsies or tests that use tissue samples.
`Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. “ECOG performance status of 4”).
`Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups.
`Radiotherapy`: Terms that indicate the use of Radiotherapy.
`Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including “recurrence”, “bad response” or “improvement”.
`Relative_Date`: Temporal references that are relative to the date of the text or to any other specific date (e.g. “yesterday” or “three years later”).
`Route`: Words indicating the type of administration route (such as “PO” or “transdermal”).
`Site_Bone`: Anatomical terms that refer to the human skeleton.
`Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum).
`Site_Breast`: Anatomical terms that refer to the breasts.
`Site_Liver`: Anatomical terms that refer to the liver.
`Site_Lung`: Anatomical terms that refer to the lungs.
`Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies.
`Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities.
`Smoking_Status`: All mentions of smoking related to the patient or to someone else.
`Staging`: Mentions of cancer stage such as “stage 2b” or “T2N1M0”. It also includes words such as “in situ”, “early-stage” or “advanced”.
`Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as “targeted therapy”.
`Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: “mass”, “tumor”, “lesion”, or “neoplasm”).
`Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. “3 cm”).
`Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. “chemoradiotherapy” or “adjuvant therapy”).
## Predicted Entities
`Histological_Type`, `Direction`, `Staging`, `Cancer_Score`, `Imaging_Test`, `Cycle_Number`, `Tumor_Finding`, `Site_Lymph_Node`, `Invasion`, `Response_To_Treatment`, `Smoking_Status`, `Tumor_Size`, `Cycle_Count`, `Adenopathy`, `Age`, `Biomarker_Result`, `Unspecific_Therapy`, `Site_Breast`, `Chemotherapy`, `Targeted_Therapy`, `Radiotherapy`, `Performance_Status`, `Pathology_Test`, `Site_Other_Body_Part`, `Cancer_Surgery`, `Line_Of_Therapy`, `Pathology_Result`, `Hormonal_Therapy`, `Site_Bone`, `Biomarker`, `Immunotherapy`, `Cycle_Day`, `Frequency`, `Route`, `Duration`, `Death_Entity`, `Metastasis`, `Site_Liver`, `Cancer_Dx`, `Grade`, `Date`, `Site_Lung`, `Site_Brain`, `Relative_Date`, `Race_Ethnicity`, `Gender`, `Oncogene`, `Dosage`, `Radiation_Dose`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_limited_80p_for_benchmarks_en_4.3.2_3.0_1680548141397.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_limited_80p_for_benchmarks_en_4.3.2_3.0_1680548141397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_limited_80p_for_benchmarks", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.
The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast.
The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_limited_80p_for_benchmarks", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.
The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast.
The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_960h_4_gram", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_960h_4_gram", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_960h_4_gram|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|227.6 MB|
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-16-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2_en_4.0.0_3.0_1657184494597.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2_en_4.0.0_3.0_1657184494597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-16-finetuned-squad-seed-2
---
layout: model
title: Icelandic RobertaForQuestionAnswering Cased model (from vesteinn)
author: John Snow Labs
name: roberta_qa_icebert
date: 2022-12-02
tags: [is, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: is
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `IceBERT-QA` is a Icelandic model originally trained by `vesteinn`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_icebert_is_4.2.4_3.0_1669972802947.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_icebert_is_4.2.4_3.0_1669972802947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_icebert","is") \
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_icebert","is")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_icebert|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|is|
|Size:|463.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/vesteinn/IceBERT-QA
---
layout: model
title: Hindi DistilBERT Embeddings (from Geotrend)
author: John Snow Labs
name: distilbert_embeddings_distilbert_base_hi_cased
date: 2022-04-12
tags: [distilbert, embeddings, hi, open_source]
task: Embeddings
language: hi
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-hi-cased` is a Hindi model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_hi_cased_hi_3.4.2_3.0_1649783460616.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_hi_cased_hi_3.4.2_3.0_1649783460616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_hi_cased","hi") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_hi_cased","hi")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("hi.embed.distilbert_base_hi_cased").predict("""मुझे स्पार्क एनएलपी पसंद है""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_distilbert_base_hi_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|hi|
|Size:|177.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/distilbert-base-hi-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Spanish RobertaForTokenClassification Cased model (from mrm8488)
author: John Snow Labs
name: roberta_ner_finetuned_bioclinical
date: 2022-07-18
tags: [open_source, roberta, ner, bioclinical, es]
task: Named Entity Recognition
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa NER model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bioclinical-roberta-es-finenuned-clinical-ner` is a Spanish model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_finetuned_bioclinical_es_4.0.0_3.0_1658155068450.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_finetuned_bioclinical_es_4.0.0_3.0_1658155068450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
ner = RoBertaForTokenClassification.pretrained("roberta_ner_finetuned_bioclinical","es") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, ner])
data = spark.createDataFrame([["PUT YOUR STRING HERE."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val ner = RoBertaForTokenClassification.pretrained("roberta_ner_finetuned_bioclinical","es")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, ner))
val data = Seq("PUT YOUR STRING HERE.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_ner_finetuned_bioclinical|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|es|
|Size:|441.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
https://huggingface.co/mrm8488/bioclinical-roberta-es-finenuned-clinical-ner
---
layout: model
title: Fast Neural Machine Translation Model from Armenian to English
author: John Snow Labs
name: opus_mt_hy_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, hy, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `hy`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_hy_en_xx_2.7.0_2.4_1609169598231.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_hy_en_xx_2.7.0_2.4_1609169598231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_hy_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_hy_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.hy.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_hy_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Persian ALBERT Embeddings (from m3hrdadfi)
author: John Snow Labs
name: albert_embeddings_albert_fa_base_v2
date: 2022-04-14
tags: [albert, embeddings, fa, open_source]
task: Embeddings
language: fa
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: AlBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-fa-base-v2` is a Persian model orginally trained by `m3hrdadfi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_fa_base_v2_fa_3.4.2_3.0_1649954318874.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_fa_base_v2_fa_3.4.2_3.0_1649954318874.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_fa_base_v2","fa") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["من عاشق جرقه NLP هستم"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_fa_base_v2","fa")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("من عاشق جرقه NLP هستم").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("fa.embed.albert").predict("""من عاشق جرقه NLP هستم""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_embeddings_albert_fa_base_v2|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|fa|
|Size:|69.2 MB|
|Case sensitive:|false|
## References
- https://huggingface.co/m3hrdadfi/albert-fa-base-v2
- https://dumps.wikimedia.org/fawiki/
- https://github.com/miras-tech/MirasText
- https://bigbangpage.com/
- https://www.chetor.com/
- https://www.eligasht.com/Blog/
- https://www.digikala.com/mag/
- https://www.ted.com/talks
- https://github.com/m3hrdadfi/albert-persian
- https://github.com/hooshvare/parsbert
- https://github.com/m3hrdadfi/albert-persian
---
layout: model
title: Lemmatizer (Portugese, SpacyLookup)
author: John Snow Labs
name: lemma_spacylookup
date: 2022-03-08
tags: [open_source, lemmatizer, pt]
task: Lemmatization
language: pt
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Portugese Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_pt_3.4.1_3.0_1646753629532.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_pt_3.4.1_3.0_1646753629532.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","pt") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["Você não é melhor que eu"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","pt")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer))
val data = Seq("Você não é melhor que eu").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("pt.lemma.spacylookup").predict("""Você não é melhor que eu""")
```
## Results
```bash
+---------------------------------+
|result |
+---------------------------------+
|[Você, não, ser, melhor, que, eu]|
+---------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma_spacylookup|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|pt|
|Size:|8.9 MB|
---
layout: model
title: Detect Assertion Status (assertion_dl_en)
author: John Snow Labs
name: assertion_dl_en
date: 2020-01-30
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 2.4.0
spark_version: 2.4
tags: [clinical, licensed, ner, en]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Deep learning named entity recognition model for assertions. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
{:.h2_title}
## Predicted Entities
``hypothetical``, ``present``, ``absent``, ``possible``, ``conditional``, ``associated_with_someone_else``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, NerConverter, AssertionDLModel.
{% include programmingLanguageSelectScalaPython.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained()\
.setInputCols("document")\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[document_assembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
clinical_assertion])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_result = LightPipeline(model).fullAnnotate('Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain')[0]
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
clinical_assertion))
val data = Seq("Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
The output is a dataframe with a sentence per row and an ``"assertion"`` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select ``"ner_chunk.result"`` and ``"assertion.result"`` from your output dataframe.
```bash
| | chunks | entities | assertion |
|---|------------|----------|-------------|
| 0 | a headache | PROBLEM | present |
| 1 | anxious | PROBLEM | conditional |
| 2 | alopecia | PROBLEM | absent |
| 3 | pain | PROBLEM | absent |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|assertion_dl|
|Type:|ner|
|Compatibility:|Spark NLP 2.4.0|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence, ner_chunk, embeddings]|
|Output Labels:|[assertion]|
|Language:|[en]|
|Case sensitive:|false|
## Data Source
Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with 'embeddings_clinical'.
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
## Benchmarking
```bash
label prec rec f1
absent 0.94 0.87 0.91
associated_with_someone_else 0.81 0.73 0.76
conditional 0.78 0.24 0.37
hypothetical 0.89 0.75 0.81
possible 0.70 0.52 0.60
present 0.91 0.97 0.94
Macro-average 0.84 0.68 0.73
Micro-average 0.91 0.91 0.91
```
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from ms12345)
author: John Snow Labs
name: roberta_qa_ms12345_base_squad2_finetuned_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `ms12345`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ms12345_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219374435.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ms12345_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219374435.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ms12345_base_squad2_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ms12345_base_squad2_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_ms12345_base_squad2_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/ms12345/roberta-base-squad2-finetuned-squad
---
layout: model
title: Fast Neural Machine Translation Model from American Sign Language to French
author: John Snow Labs
name: opus_mt_ase_fr
date: 2021-06-01
tags: [open_source, seq2seq, translation, ase, fr, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: ase
target languages: fr
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ase_fr_xx_3.1.0_2.4_1622561105602.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ase_fr_xx_3.1.0_2.4_1622561105602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ase_fr", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ase_fr", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.American Sign Language.translate_to.French').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ase_fr|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering model (from mrm8488) Xqua
author: John Snow Labs
name: distilbert_qa_multi_finetuned_for_xqua_on_tydiqa
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-multi-finetuned-for-xqua-on-tydiqa` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_multi_finetuned_for_xqua_on_tydiqa_en_4.0.0_3.0_1654727619668.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_multi_finetuned_for_xqua_on_tydiqa_en_4.0.0_3.0_1654727619668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_multi_finetuned_for_xqua_on_tydiqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_multi_finetuned_for_xqua_on_tydiqa","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.tydiqa.distil_bert").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_multi_finetuned_for_xqua_on_tydiqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|505.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/distilbert-multi-finetuned-for-xqua-on-tydiqa
- https://ai.google.com/research/tydiqa
- https://github.com/google-research-datasets/tydiqa/blob/master/README.md#the-tasks
- https://twitter.com/mrm8488
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-32-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10_en_4.0.0_3.0_1657184919799.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10_en_4.0.0_3.0_1657184919799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-32-finetuned-squad-seed-10
---
layout: model
title: English asr_wav2vec2_base_100h_test TFWav2Vec2ForCTC from saahith
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_100h_test
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_test` is a English model originally trained by saahith.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_test_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_test_en_4.2.0_3.0_1664094940990.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_test_en_4.2.0_3.0_1664094940990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_test', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_test", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_100h_test|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|227.9 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English T5ForConditionalGeneration Cased model (from osanseviero)
author: John Snow Labs
name: t5_finetuned_test
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-finetuned-test` is a English model originally trained by `osanseviero`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetuned_test_en_4.3.0_3.0_1675124670488.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetuned_test_en_4.3.0_3.0_1675124670488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_finetuned_test","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_finetuned_test","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_finetuned_test|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|286.9 MB|
## References
- https://huggingface.co/osanseviero/t5-finetuned-test
- https://medium.com/@priya.dwivedi/fine-tuning-a-t5-transformer-for-any-summarization-task-82334c64c81
---
layout: model
title: News Classifier Pipeline for German text
author: John Snow Labs
name: classifierdl_bert_news_pipeline
date: 2021-08-13
tags: [de, classifier, pipeline, news, open_source]
task: Pipeline Public
language: de
edition: Spark NLP 3.1.3
spark_version: 2.4
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pre-trained pipeline classifies German texts of news.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_DE_NEWS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_DE_NEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_pipeline_de_3.1.3_2.4_1628851787696.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_pipeline_de_3.1.3_2.4_1628851787696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("classifierdl_bert_news_pipeline", lang = "de")
result = pipeline.fullAnnotate("""Niki Lauda in einem McLaren MP 4/2 TAG Turbo. Mit diesem Gefährt sicherte sich der Österreicher 1984 seinen dritten Weltmeistertitel, einen halben (!)""")
```
```scala
val pipeline = new PretrainedPipeline("classifierdl_bert_news_pipeline", "de")
val result = pipeline.fullAnnotate("Niki Lauda in einem McLaren MP 4/2 TAG Turbo. Mit diesem Gefährt sicherte sich der Österreicher 1984 seinen dritten Weltmeistertitel, einen halben (!)")(0)
```
## Results
```bash
["Sport"]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_bert_news_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.1.3+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
## Included Models
- DocumentAssembler
- BertSentenceEmbeddings
- ClassifierDLModel
---
layout: model
title: English DistilBertForTokenClassification Base Cased model (from 51la5)
author: John Snow Labs
name: distilbert_token_classifier_base_ner
date: 2023-03-06
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-NER` is a English model originally trained by `51la5`.
## Predicted Entities
`LOC`, `ORG`, `PER`, `MISC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_ner_en_4.3.1_3.0_1678133783319.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_ner_en_4.3.1_3.0_1678133783319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_ner","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_ner","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_base_ner|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/51la5/distilbert-base-NER
- https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003
---
layout: model
title: English ALBERT Embeddings (x-large)
author: John Snow Labs
name: albert_embeddings_albert_xlarge_v1
date: 2022-04-14
tags: [albert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: AlBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-xlarge-v1` is a English model orginally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_xlarge_v1_en_3.4.2_3.0_1649954213986.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_xlarge_v1_en_3.4.2_3.0_1649954213986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_xlarge_v1","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_xlarge_v1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.albert_xlarge_v1").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_embeddings_albert_xlarge_v1|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|221.6 MB|
|Case sensitive:|false|
## References
- https://huggingface.co/albert-xlarge-v1
- https://arxiv.org/abs/1909.11942
- https://github.com/google-research/albert
- https://yknzhu.wixsite.com/mbweb
- https://en.wikipedia.org/wiki/English_Wikipedia
---
layout: model
title: Recognize Entities DL Pipeline for Russian - Medium
author: John Snow Labs
name: entity_recognizer_md
date: 2021-03-22
tags: [open_source, russian, entity_recognizer_md, pipeline, ru]
supported: true
task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging]
language: ru
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps.
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_ru_3.0.0_3.0_1616448672830.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_ru_3.0.0_3.0_1616448672830.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'ru')
annotations = pipeline.fullAnnotate(""Здравствуйте из Джона Снежных Лабораторий! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "ru")
val result = pipeline.fullAnnotate("Здравствуйте из Джона Снежных Лабораторий! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Здравствуйте из Джона Снежных Лабораторий! ""]
result_df = nlu.load('ru.ner.md').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:------------------------------------------------|:-----------------------------------------------|:-----------------------------------------------------------|:-----------------------------|:--------------------------------------|:-------------------------------|
| 0 | ['Здравствуйте из Джона Снежных Лабораторий! '] | ['Здравствуйте из Джона Снежных Лабораторий!'] | ['Здравствуйте', 'из', 'Джона', 'Снежных', 'Лабораторий!'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-LOC', 'I-LOC', 'I-LOC'] | ['Джона Снежных Лабораторий!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_md|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|ru|
---
layout: model
title: RE Pipeline between Body Parts and Direction Entities
author: John Snow Labs
name: re_bodypart_directions_pipeline
date: 2023-06-13
tags: [licensed, clinical, relation_extraction, body_part, directions, en]
task: Relation Extraction
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [re_bodypart_directions](https://nlp.johnsnowlabs.com/2021/01/18/re_bodypart_directions_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_directions_pipeline_en_4.4.4_3.2_1686664392280.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_directions_pipeline_en_4.4.4_3.2_1686664392280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("re_bodypart_directions_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("re_bodypart_directions_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.bodypart_directions.pipeline").predict("""MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("re_bodypart_directions_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("re_bodypart_directions_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.bodypart_directions.pipeline").predict("""MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia""")
```
## Results
```bash
Results
| index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence |
|-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------|
| 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 |
| 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 |
| 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 |
| 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 |
| 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 |
| 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 |
| 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 |
| 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 |
| 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 |
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|re_bodypart_directions_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- PerceptronModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- DependencyParserModel
- RelationExtractionModel
---
layout: model
title: Icelandic DistilBertForTokenClassification Cased model (from m3hrdadfi)
author: John Snow Labs
name: distilbert_token_classifier_typo_detector
date: 2023-03-06
tags: [is, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: is
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `typo-detector-distilbert-is` is a Icelandic model originally trained by `m3hrdadfi`.
## Predicted Entities
`TYPO`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dtilbert_token_classifier_typo_detector_is_4.3.1_3.0_1678134296845.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dtilbert_token_classifier_typo_detector_is_4.3.1_3.0_1678134296845.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("dtilbert_token_classifier_typo_detector","is") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("dtilbert_token_classifier_typo_detector","is")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|dtilbert_token_classifier_typo_detector|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|is|
|Size:|505.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/m3hrdadfi/typo-detector-distilbert-is
- https://github.com/m3hrdadfi/typo-detector/issues
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab11_by_sameearif88 TFWav2Vec2ForCTC from sameearif88
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab11_by_sameearif88` is a English model originally trained by sameearif88.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_en_4.2.0_3.0_1664021315565.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_en_4.2.0_3.0_1664021315565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from Dongjae)
author: John Snow Labs
name: xlm_roberta_qa_mrc2reader
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mrc2reader` is a English model originally trained by `Dongjae`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_mrc2reader_en_4.0.0_3.0_1655987882933.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_mrc2reader_en_4.0.0_3.0_1655987882933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_mrc2reader","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_mrc2reader","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_mrc2reader|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.9 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Dongjae/mrc2reader
---
layout: model
title: Legal Dispute Resolve Clause Binary Classifier
author: John Snow Labs
name: legclf_dispute_resol_clause
date: 2023-02-13
tags: [en, legal, classification, dispute, resolve, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `dispute_resol` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`dispute_resol`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dispute_resol_clause_en_1.0.0_3.0_1676303229805.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dispute_resol_clause_en_1.0.0_3.0_1676303229805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[dispute_resol]|
|[other]|
|[other]|
|[dispute_resol]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_dispute_resol_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
dispute_resol 0.94 0.89 0.92 19
other 0.83 0.91 0.87 11
accuracy - - 0.90 30
macro-avg 0.89 0.90 0.89 30
weighted-avg 0.90 0.90 0.90 30
```
---
layout: model
title: RE Pipeline between Problem, Test, and Findings in Reports
author: John Snow Labs
name: re_test_problem_finding_pipeline
date: 2023-06-13
tags: [licensed, clinical, relation_extraction, problem, test, findings, en]
task: Relation Extraction
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [re_test_problem_finding](https://nlp.johnsnowlabs.com/2021/04/19/re_test_problem_finding_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_test_problem_finding_pipeline_en_4.4.4_3.2_1686665114725.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_test_problem_finding_pipeline_en_4.4.4_3.2_1686665114725.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("re_test_problem_finding_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Targeted biopsy of this lesion for histological correlation should be considered.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("re_test_problem_finding_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Targeted biopsy of this lesion for histological correlation should be considered.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.test_problem_finding.pipeline").predict("""Targeted biopsy of this lesion for histological correlation should be considered.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("re_test_problem_finding_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Targeted biopsy of this lesion for histological correlation should be considered.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("re_test_problem_finding_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("Targeted biopsy of this lesion for histological correlation should be considered.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.test_problem_finding.pipeline").predict("""Targeted biopsy of this lesion for histological correlation should be considered.""")
```
## Results
```bash
Results
| index | relations | entity1 | chunk1 | entity2 | chunk2 |
|-------|--------------|--------------|---------------------|--------------|---------|
| 0 | 1 | PROCEDURE | biopsy | SYMPTOM | lesion |
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|re_test_problem_finding_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- PerceptronModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- DependencyParserModel
- RelationExtractionModel
---
layout: model
title: Pipeline to Detect Clinical Entities (jsl_ner_wip_clinical)
author: John Snow Labs
name: jsl_ner_wip_clinical_pipeline
date: 2023-03-15
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [jsl_ner_wip_clinical](https://nlp.johnsnowlabs.com/2021/03/31/jsl_ner_wip_clinical_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_pipeline_en_4.3.0_3.2_1678875196882.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_pipeline_en_4.3.0_3.2_1678875196882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("jsl_ner_wip_clinical_pipeline", "en", "clinical/models")
text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("jsl_ner_wip_clinical_pipeline", "en", "clinical/models")
val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.jsl_wip_clinical.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------------------------------|--------:|------:|:-----------------------------|-------------:|
| 0 | 21-day-old | 17 | 26 | Age | 0.9984 |
| 1 | Caucasian | 28 | 36 | Race_Ethnicity | 1 |
| 2 | male | 38 | 41 | Gender | 0.9986 |
| 3 | for 2 days | 48 | 57 | Duration | 0.678133 |
| 4 | congestion | 62 | 71 | Symptom | 0.9693 |
| 5 | mom | 75 | 77 | Gender | 0.7091 |
| 6 | yellow | 99 | 104 | Modifier | 0.667 |
| 7 | discharge | 106 | 114 | Symptom | 0.3037 |
| 8 | nares | 135 | 139 | External_body_part_or_region | 0.89 |
| 9 | she | 147 | 149 | Gender | 0.9992 |
| 10 | mild | 168 | 171 | Modifier | 0.8106 |
| 11 | problems with his breathing while feeding | 173 | 213 | Symptom | 0.500483 |
| 12 | perioral cyanosis | 237 | 253 | Symptom | 0.54895 |
| 13 | retractions | 258 | 268 | Symptom | 0.9847 |
| 14 | One day ago | 272 | 282 | RelativeDate | 0.550167 |
| 15 | mom | 285 | 287 | Gender | 0.573 |
| 16 | Tylenol | 345 | 351 | Drug_BrandName | 0.9958 |
| 17 | Baby | 354 | 357 | Age | 0.9989 |
| 18 | decreased p.o. intake | 377 | 397 | Symptom | 0.22495 |
| 19 | His | 400 | 402 | Gender | 0.9997 |
| 20 | 20 minutes | 439 | 448 | Duration | 0.1453 |
| 21 | q.2h. to | 450 | 457 | Frequency | 0.413667 |
| 22 | 5 to 10 minutes | 459 | 473 | Duration | 0.152125 |
| 23 | his | 488 | 490 | Gender | 0.9987 |
| 24 | respiratory congestion | 492 | 513 | VS_Finding | 0.6458 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|jsl_ner_wip_clinical_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Ganda asr_wav2vec2_luganda_by_indonesian_nlp TFWav2Vec2ForCTC from indonesian-nlp
author: John Snow Labs
name: pipeline_asr_wav2vec2_luganda_by_indonesian_nlp
date: 2022-09-24
tags: [wav2vec2, lg, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: lg
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_luganda_by_indonesian_nlp` is a Ganda model originally trained by indonesian-nlp.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_luganda_by_indonesian_nlp_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_luganda_by_indonesian_nlp_lg_4.2.0_3.0_1664036315040.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_luganda_by_indonesian_nlp_lg_4.2.0_3.0_1664036315040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_luganda_by_indonesian_nlp', lang = 'lg')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_luganda_by_indonesian_nlp", lang = "lg")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_luganda_by_indonesian_nlp|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|lg|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Limitation Of Liability Clause Binary Classifier
author: John Snow Labs
name: legclf_limitation_of_liability_clause
date: 2022-12-18
tags: [en, legal, classification, licensed, clause, bert, limitation, of, liability, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the limitation-of-liability clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`limitation-of-liability`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_limitation_of_liability_clause_en_1.0.0_3.0_1671393635939.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_limitation_of_liability_clause_en_1.0.0_3.0_1671393635939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[limitation-of-liability]|
|[other]|
|[other]|
|[limitation-of-liability]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_limitation_of_liability_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
limitation-of-liability 0.93 0.90 0.91 29
other 0.93 0.95 0.94 39
accuracy - - 0.93 68
macro-avg 0.93 0.92 0.92 68
weighted-avg 0.93 0.93 0.93 68
```
---
layout: model
title: Fast Neural Machine Translation Model from English to Irish
author: John Snow Labs
name: opus_mt_en_ga
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ga, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ga`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ga_xx_2.7.0_2.4_1609170911340.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ga_xx_2.7.0_2.4_1609170911340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ga", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ga", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ga').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ga|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal Introductory Clause Binary Classifier
author: John Snow Labs
name: legclf_introduction_clause
date: 2022-11-17
tags: [introduction, parties, document, en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the first introductory clause, where the Document Type and the Parties are mentioned. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`introduction`, `other`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_introduction_clause_en_1.0.0_3.0_1668680203953.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_introduction_clause_en_1.0.0_3.0_1668680203953.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[introduction]|
|[other]|
|[other]|
|[introduction]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_introduction_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|23.2 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house, including CUAD dataset.
## Benchmarking
```bash
label precision recall f1-score support
introduction 1.00 0.98 0.99 99
other 0.99 1.00 0.99 151
accuracy - - 0.99 250
macro-avg 0.99 0.99 0.99 250
weighted-avg 0.99 0.99 0.99 250
```
---
layout: model
title: Legal Indemnifications Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_indemnifications_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, indemnifications, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Indemnifications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Indemnifications`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnifications_bert_en_1.0.0_3.0_1678050524998.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnifications_bert_en_1.0.0_3.0_1678050524998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Indemnifications]|
|[Other]|
|[Other]|
|[Indemnifications]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_indemnifications_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Indemnifications 0.95 0.89 0.92 106
Other 0.92 0.96 0.94 141
accuracy - - 0.93 247
macro-avg 0.93 0.93 0.93 247
weighted-avg 0.93 0.93 0.93 247
```
---
layout: model
title: English image_classifier_vit_base_beans_demo ViTForImageClassification from nateraw
author: John Snow Labs
name: image_classifier_vit_base_beans_demo
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_beans_demo` is a English model originally trained by nateraw.
## Predicted Entities
`angular_leaf_spot`, `bean_rust`, `healthy`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_demo_en_4.1.0_3.0_1660168105525.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_demo_en_4.1.0_3.0_1660168105525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_base_beans_demo", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_base_beans_demo", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_base_beans_demo|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Sentence Entity Resolver for Snomed Concepts, INT version (``sbiobert_base_cased_mli`` embeddings)
author: John Snow Labs
name: sbiobertresolve_snomed_findings_int
date: 2021-05-16
tags: [entity_resolution, clinical, licensed, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.4
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to Snomed codes (INT version) using using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements.
## Predicted Entities
Predicts Snomed Codes and their normalized definition for each chunk.
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_int_en_3.0.4_3.0_1621189624936.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_int_en_3.0.4_3.0_1621189624936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
chunk2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
snomed_int_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings_int","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_int_resolver])
data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val chunk2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("ner_chunk_doc")
.setOutputCol("sbert_embeddings")
val snomed_int_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings_int","en", "clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_int_resolver))
val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.snomed.findings_int").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""")
```
## Results
```bash
+--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+
| chunk|begin|end| entity| code|confidence| resolutions| codes|
+--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+
| hypertension| 68| 79| PROBLEM| 266285003| 0.8867|rheumatic myocard...|266285003:::15529...|
|chronic renal ins...| 83|109| PROBLEM| 236425005| 0.2470|chronic renal imp...|236425005:::90688...|
| COPD| 113|116| PROBLEM| 413839001| 0.0720|chronic lung dise...|413839001:::41384...|
| gastritis| 120|128| PROBLEM| 266502003| 0.3240|acute peptic ulce...|266502003:::45560...|
| TIA| 136|138| PROBLEM|353101000119105| 0.0727|prostatic intraep...|353101000119105::...|
|a non-ST elevatio...| 182|202| PROBLEM| 233843008| 0.2846|silent myocardial...|233843008:::71942...|
|Guaiac positive s...| 208|229| PROBLEM| 168319009| 0.1167|stool culture pos...|168319009:::70396...|
|cardiac catheteri...| 295|317| TEST| 301095005| 0.2137|cardiac finding::...|301095005:::25090...|
| PTCA| 324|327|TREATMENT|842741000000109| 0.0631|occlusion of post...|842741000000109::...|
| mid LAD lesion| 332|345| PROBLEM| 449567000| 0.0808|overriding left v...|449567000:::25342...|
+--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_snomed_findings_int|
|Compatibility:|Healthcare NLP 3.0.4+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk, sbert_embeddings]|
|Output Labels:|[snomed_int_code]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on SNOMED (INT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings.
http://www.snomed.org/
---
layout: model
title: English DistilBertForQuestionAnswering model (from holtin) Squad2
author: John Snow Labs
name: distilbert_qa_base_uncased_holtin_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-holtin-finetuned-squad` is a English model originally trained by `holtin`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_squad_en_4.0.0_3.0_1654727128971.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_squad_en_4.0.0_3.0_1654727128971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_holtin").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_uncased_holtin_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/holtin/distilbert-base-uncased-holtin-finetuned-squad
---
layout: model
title: Legal Annual Bonus Clause Binary Classifier
author: John Snow Labs
name: legclf_annual_bonus_clause
date: 2023-01-29
tags: [en, legal, classification, annual, bonus, clauses, annual_bonus, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `annual-bonus` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`annual-bonus`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_annual_bonus_clause_en_1.0.0_3.0_1675005760165.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_annual_bonus_clause_en_1.0.0_3.0_1675005760165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[annual-bonus]|
|[other]|
|[other]|
|[annual-bonus]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_annual_bonus_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
annual-bonus 1.00 0.94 0.97 33
other 0.95 1.00 0.97 39
accuracy - - 0.97 72
macro-avg 0.98 0.97 0.97 72
weighted-avg 0.97 0.97 0.97 72
```
---
layout: model
title: Pipeline to Detect Clinical Entities (WIP Greedy)
author: John Snow Labs
name: jsl_ner_wip_greedy_biobert_pipeline
date: 2022-03-21
tags: [licensed, ner, wip, biobert, greedy, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [jsl_ner_wip_greedy_biobert](https://nlp.johnsnowlabs.com/2021/07/26/jsl_ner_wip_greedy_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_biobert_pipeline_en_3.4.1_3.0_1647866004113.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_biobert_pipeline_en_3.4.1_3.0_1647866004113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("jsl_ner_wip_greedy_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
```scala
val pipeline = new PretrainedPipeline("jsl_ner_wip_greedy_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.greedy_wip_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
| | chunk | entity |
|---:|:-----------------------------------------------|:-----------------------------|
| 0 | 21-day-old | Age |
| 1 | Caucasian | Race_Ethnicity |
| 2 | male | Gender |
| 3 | for 2 days | Duration |
| 4 | congestion | Symptom |
| 5 | mom | Gender |
| 6 | suctioning yellow discharge | Symptom |
| 7 | nares | External_body_part_or_region |
| 8 | she | Gender |
| 9 | mild problems with his breathing while feeding | Symptom |
| 10 | perioral cyanosis | Symptom |
| 11 | retractions | Symptom |
| 12 | One day ago | RelativeDate |
| 13 | mom | Gender |
| 14 | tactile temperature | Symptom |
| 15 | Tylenol | Drug |
| 16 | Baby | Age |
| 17 | decreased p.o. intake | Symptom |
| 18 | His | Gender |
| 19 | breast-feeding | External_body_part_or_region |
| 20 | q.2h | Frequency |
| 21 | to 5 to 10 minutes | Duration |
| 22 | his | Gender |
| 23 | respiratory congestion | Symptom |
| 24 | He | Gender |
| 25 | tired | Symptom |
| 26 | fussy | Symptom |
| 27 | over the past 2 days | RelativeDate |
| 28 | albuterol | Drug |
| 29 | ER | Clinical_Dept |
| 30 | His | Gender |
| 31 | urine output has also decreased | Symptom |
| 32 | he | Gender |
| 33 | per 24 hours | Frequency |
| 34 | he | Gender |
| 35 | per 24 hours | Frequency |
| 36 | Mom | Gender |
| 37 | diarrhea | Symptom |
| 38 | His | Gender |
| 39 | bowel | Internal_organ_or_component |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|jsl_ner_wip_greedy_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
---
layout: model
title: Legal Alias Pipeline
author: John Snow Labs
name: legpipe_alias
date: 2023-04-30
tags: [en, legal, ner, pipeline, alias, licensed]
task: Pipeline Legal
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline allows you to detect names in quotes and brackets like: ("Supplier"), ("Recipient"), ("Disclosing Parties"), etc. very common in Legal Agreements to reference the parties.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legpipe_alias_en_1.0.0_3.0_1682861474127.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legpipe_alias_en_1.0.0_3.0_1682861474127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
legal_pipeline = nlp.PretrainedPipeline("legpipe_alias", "en", "legal/models")
text = ["""MUTUAL NON-DISCLOSURE AGREEMENT
This Mutual Non-Disclosure Agreement (the “Agreement”) is made on _________ by and between:
John Snow Labs, a Delaware corporation, registered at 16192 Coastal Highway, Lewes, Delaware 19958 (“John Snow Labs”), and
Acentos, S.L, a Spanish corporation, registered at Gran Via 71, 2º floor (“Company”), (each a “party” and together the “parties”).
Recitals:
John Snow Labs and Company intend to explore the possibility of a business relationship between each other, whereby each party (“Discloser”) may disclose sensitive information to the other party (“Recipient”).
The parties agree as follows:"""]
result = legal_pipeline.annotate(text)
```
## Results
```bash
['(“John Snow Labs”)', '(“Company”)', '( “ Discloser ” )', '(“Recipient”)']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legpipe_alias|
|Type:|pipeline|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|13.1 KB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ContextualParserModel
---
layout: model
title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili
date: 2022-08-01
tags: [sw, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: sw
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-yoruba-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`.
## Predicted Entities
`DATE`, `PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili_sw_4.1.0_3.0_1659356236533.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili_sw_4.1.0_3.0_1659356236533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili","sw") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili","sw")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|sw|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-yoruba-finetuned-ner-swahili
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://github.com/masakhane-io/masakhane-ner
---
layout: model
title: Finnish asr_wav2vec2_xlsr_1b_finnish TFWav2Vec2ForCTC from aapot
author: John Snow Labs
name: asr_wav2vec2_xlsr_1b_finnish
date: 2022-09-24
tags: [wav2vec2, fi, audio, open_source, asr]
task: Automatic Speech Recognition
language: fi
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish` is a Finnish model originally trained by aapot.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_1b_finnish_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_fi_4.2.0_3.0_1664018554548.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_fi_4.2.0_3.0_1664018554548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xlsr_1b_finnish", "fi")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xlsr_1b_finnish", "fi")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xlsr_1b_finnish|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|fi|
|Size:|3.6 GB|
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from monakth)
author: John Snow Labs
name: distilbert_qa_base_uncased_fine_tuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distillbert-base-uncased-fine-tuned-squad` is a English model originally trained by `monakth`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_fine_tuned_squad_en_4.3.0_3.0_1672774847284.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_fine_tuned_squad_en_4.3.0_3.0_1672774847284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_fine_tuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_fine_tuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_modeversion1_m6_e4", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_modeversion1_m6_e4", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_modeversion1_m6_e4|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|322.3 MB|
---
layout: model
title: Malay ALBERT Embeddings (Large)
author: John Snow Labs
name: albert_embeddings_albert_large_bahasa_cased
date: 2022-04-14
tags: [albert, embeddings, ms, open_source]
task: Embeddings
language: ms
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: AlBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-large-bahasa-cased` is a Malay model orginally trained by `malay-huggingface`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_large_bahasa_cased_ms_3.4.2_3.0_1649954345847.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_large_bahasa_cased_ms_3.4.2_3.0_1649954345847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_large_bahasa_cased","ms") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Saya suka Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_large_bahasa_cased","ms")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Saya suka Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ms.embed.albert").predict("""Saya suka Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_embeddings_albert_large_bahasa_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ms|
|Size:|68.8 MB|
|Case sensitive:|false|
## References
- https://huggingface.co/malay-huggingface/albert-large-bahasa-cased
- https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
- https://github.com/huseinzol05/malay-dataset/tree/master/corpus/pile
- https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/albert
---
layout: model
title: Japanese BertForMaskedLM Base Cased model (from cl-tohoku)
author: John Snow Labs
name: bert_embeddings_base_japanese_v2
date: 2022-12-02
tags: [ja, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ja
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-japanese-v2` is a Japanese model originally trained by `cl-tohoku`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_v2_ja_4.2.4_3.0_1670018249307.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_v2_ja_4.2.4_3.0_1670018249307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_v2","ja") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_v2","ja")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_japanese_v2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ja|
|Size:|417.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/cl-tohoku/bert-base-japanese-v2
- https://github.com/google-research/bert
- https://pypi.org/project/unidic-lite/
- https://github.com/cl-tohoku/bert-japanese/tree/v2.0
- https://taku910.github.io/mecab/
- https://github.com/neologd/mecab-ipadic-neologd
- https://github.com/polm/fugashi
- https://github.com/polm/unidic-lite
- https://www.tensorflow.org/tfrc/
- https://creativecommons.org/licenses/by-sa/3.0/
- https://www.tensorflow.org/tfrc/
---
layout: model
title: English BertForQuestionAnswering model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-4` is a English model orginally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4_en_4.0.0_3.0_1654191739285.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4_en_4.0.0_3.0_1654191739285.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.span_bert.base_cased_64d_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|378.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-4
---
layout: model
title: Translate Welsh to English Pipeline
author: John Snow Labs
name: translate_cy_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, cy, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `cy`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_cy_en_xx_2.7.0_2.4_1609689849644.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_cy_en_xx_2.7.0_2.4_1609689849644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_cy_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_cy_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.cy.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_cy_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-256-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4_en_4.0.0_3.0_1657184760188.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4_en_4.0.0_3.0_1657184760188.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-256-finetuned-squad-seed-4
---
layout: model
title: Legal Duration and termination Clause Binary Classifier
author: John Snow Labs
name: legclf_duration_and_termination_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `duration-and-termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `duration-and-termination`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_duration_and_termination_clause_en_1.0.0_3.2_1660122389421.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_duration_and_termination_clause_en_1.0.0_3.2_1660122389421.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[duration-and-termination]|
|[other]|
|[other]|
|[duration-and-termination]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_duration_and_termination_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
duration-and-termination 0.93 0.89 0.91 28
other 0.97 0.98 0.98 107
accuracy - - 0.96 135
macro-avg 0.95 0.94 0.94 135
weighted-avg 0.96 0.96 0.96 135
```
---
layout: model
title: Detect Living Species
author: John Snow Labs
name: ner_living_species
date: 2022-06-22
tags: [en, ner, clinical, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract living species from clinical texts which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture.
It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others.
**NOTE :**
1. The text files were translated from Spanish with a neural machine translation system.
2. The annotations were translated with the same neural machine translation system.
3. The translated annotations were transferred to the translated text files using an annotation transfer technology.
## Predicted Entities
`HUMAN`, `SPECIES`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_en_3.5.3_3.0_1655888659088.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_en_3.5.3_3.0_1655888659088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
.setInputCols("sentence","token")\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_living_species", "en","clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")\
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_living_species", "en","clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter))
val data = Seq("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.living_species").predict("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""")
```
## Results
```bash
+-----------------------+-------+
|ner_chunk |label |
+-----------------------+-------+
|woman |HUMAN |
|bacterial |SPECIES|
|Fusarium spp |SPECIES|
|patient |HUMAN |
|species |SPECIES|
|Fusarium solani complex|SPECIES|
|antifungals |SPECIES|
+-----------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_living_species|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|15.1 MB|
## References
[https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/)
## Benchmarking
```bash
label precision recall f1-score support
B-HUMAN 0.84 0.96 0.90 2950
B-SPECIES 0.73 0.92 0.81 3129
I-HUMAN 0.69 0.68 0.69 145
I-SPECIES 0.66 0.89 0.76 1166
micro-avg 0.76 0.93 0.83 7390
macro-avg 0.73 0.86 0.79 7390
weighted-avg 0.76 0.93 0.83 7390
```
---
layout: model
title: Pipeline to Detect PHI in text (ner_deid_sd_larg)
author: John Snow Labs
name: ner_deid_sd_large_pipeline
date: 2023-03-13
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_sd_large](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_sd_large_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_large_pipeline_en_4.3.0_3.2_1678733016225.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_large_pipeline_en_4.3.0_3.2_1678733016225.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_sd_large_pipeline", "en", "clinical/models")
text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_deid_sd_large_pipeline", "en", "clinical/models")
val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.med_ner_large.pipeline").predict("""Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.""")
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:------------------------------|--------:|------:|:------------|-------------:|
| 0 | 2093-01-13 | 14 | 23 | DATE | 0.9999 |
| 1 | David Hale | 27 | 36 | NAME | 0.90085 |
| 2 | Hendrickson Ora | 55 | 69 | NAME | 0.94935 |
| 3 | 7194334 | 78 | 84 | ID | 0.9988 |
| 4 | 01/13/93 | 93 | 100 | DATE | 0.9913 |
| 5 | Oliveira | 110 | 117 | NAME | 0.9924 |
| 6 | 25 | 121 | 122 | AGE | 0.987 |
| 7 | 2079-11-09 | 150 | 159 | DATE | 0.9952 |
| 8 | Cocke County Baptist Hospital | 163 | 191 | LOCATION | 0.795975 |
| 9 | 0295 Keats Street | 195 | 211 | LOCATION | 0.741567 |
| 10 | 302-786-5227 | 221 | 232 | CONTACT | 0.984 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_sd_large_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_6_h_256
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-256` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_256_zh_4.2.4_3.0_1670325978179.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_256_zh_4.2.4_3.0_1670325978179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_256","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_256","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_6_h_256|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|39.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-6_H-256
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: English RobertaForSequenceClassification Base Cased model (from mrm8488)
author: John Snow Labs
name: roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression
date: 2022-07-13
tags: [en, open_source, roberta, sequence_classification]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-finetuned-suicide-depression` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression_en_4.0.0_3.0_1657715865562.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression_en_4.0.0_3.0_1657715865562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression","en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|309.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mrm8488/distilroberta-base-finetuned-suicide-depression
- https://github.com/ayaanzhaque/SDCNL
---
layout: model
title: German Electra Embeddings (from deepset)
author: John Snow Labs
name: electra_embeddings_gelectra_base_generator
date: 2022-05-17
tags: [de, open_source, electra, embeddings]
task: Embeddings
language: de
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gelectra-base-generator` is a German model orginally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_gelectra_base_generator_de_3.4.4_3.0_1652786833144.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_gelectra_base_generator_de_3.4.4_3.0_1652786833144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_gelectra_base_generator","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_gelectra_base_generator","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ich liebe Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_gelectra_base_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|de|
|Size:|128.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/deepset/gelectra-base-generator
- https://arxiv.org/pdf/2010.10906.pdf
- https://arxiv.org/pdf/2010.10906.pdf
- https://deepset.ai/german-bert
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/FARM
- https://github.com/deepset-ai/haystack/
- https://twitter.com/deepset_ai
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- http://www.deepset.ai/jobs
---
layout: model
title: Detect Persons, Locations, Organizations and Misc Entities in English
author: gokhanturer
name: Ner_conll2003_100d
date: 2022-02-08
tags: [opern_source, ner, glove_100d, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.1.2
spark_version: 3.0
supported: false
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This NER model trained with GloVe 100d word embeddings, annotates text to find features like the names of people , places and organizations.
```python
nerdl_model = NerDLModel.pretrained("Ner_conll2003_100d", "en", "@gokhanturer")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
```
## Predicted Entities
`PER`, `LOC`, `ORG`, `MISC`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/drive/1KtA_K-7_xO0oxQ7DhJtU5RPHtGnMzK8z#scrollTo=BP1iPII8PTdb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/gokhanturer/Ner_conll2003_100d_en_3.1.2_3.0_1644322842689.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/gokhanturer/Ner_conll2003_100d_en_3.1.2_3.0_1644322842689.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Open In Colab
JOHNSNOWLABS_LOGO.png
Colab Setup
In [1]:
! pip install -q pyspark==3.1.2 spark-nlp
! pip install -q spark-nlp-display
|████████████████████████████████| 212.4 MB 82 kB/s
|████████████████████████████████| 140 kB 61.0 MB/s
|████████████████████████████████| 198 kB 73.0 MB/s
Building wheel for pyspark (setup.py) ... done
|████████████████████████████████| 95 kB 2.0 MB/s
|████████████████████████████████| 66 kB 3.5 MB/s
In [3]:
import sparknlp
spark = sparknlp.start(gpu = True)
from sparknlp.base import *
from sparknlp.annotator import *
import pyspark.sql.functions as F
from sparknlp.training import CoNLL
print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)
spark
Spark NLP version 3.4.0
Apache Spark version: 3.1.2
Out[3]:
SparkSession - in-memory
SparkContext
Spark UI
Versionv3.1.2Masterlocal[*]AppNameSpark NLP
CONLL Data Prep
In [2]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa
Train Data
In [5]:
with open ("eng.train") as f:
train_data = f.read()
print (train_data[:500])
-DOCSTART- -X- -X- O
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER
BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O
The DT B-NP O
European NNP I-NP B-ORG
Commission NNP I-NP I-ORG
said VBD B-VP O
on IN B-PP O
Thursday NNP B-NP O
it PRP B-NP O
disagreed VBD B-VP O
with IN B-PP O
German JJ B-NP B-MISC
advice NN I-NP O
to TO B-PP O
consumers NNS B-NP
In [6]:
train_data = CoNLL().readDataset(spark, 'eng.train')
train_data.show(3)
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| text| document| sentence| token| pos| label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
| Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
| BRUSSELS 1996-08-22|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 7, BR...|[{pos, 0, 7, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows
In [7]:
train_data.count()
Out[7]:
14041
In [8]:
train_data.select(F.explode(F.arrays_zip('token.result', 'pos.result', 'label.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("pos"),
F.expr("cols['2']").alias("ner_label")).show(truncate=50)
+----------+---+---------+
| token|pos|ner_label|
+----------+---+---------+
| EU|NNP| B-ORG|
| rejects|VBZ| O|
| German| JJ| B-MISC|
| call| NN| O|
| to| TO| O|
| boycott| VB| O|
| British| JJ| B-MISC|
| lamb| NN| O|
| .| .| O|
| Peter|NNP| B-PER|
| Blackburn|NNP| I-PER|
| BRUSSELS|NNP| B-LOC|
|1996-08-22| CD| O|
| The| DT| O|
| European|NNP| B-ORG|
|Commission|NNP| I-ORG|
| said|VBD| O|
| on| IN| O|
| Thursday|NNP| O|
| it|PRP| O|
+----------+---+---------+
only showing top 20 rows
In [9]:
train_data.select(F.explode(F.arrays_zip("token.result","label.result")).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth")).groupBy("ground_truth").count().orderBy("count", ascending=False).show(100,truncate=False)
+------------+------+
|ground_truth|count |
+------------+------+
|O |169578|
|B-LOC |7140 |
|B-PER |6600 |
|B-ORG |6321 |
|I-PER |4528 |
|I-ORG |3704 |
|B-MISC |3438 |
|I-LOC |1157 |
|I-MISC |1155 |
+------------+------+
In [10]:
#conll_data.select(F.countDistinct("label.result")).show()
#conll_data.groupBy("label.result").count().show(truncate=False)
train_data = train_data.withColumn('unique', F.array_distinct("label.result"))\
.withColumn('c', F.size('unique'))\
.filter(F.col('c')>1)
train_data.select(F.explode(F.arrays_zip('token.result','label.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"))\
.groupBy('ground_truth')\
.count()\
.orderBy('count', ascending=False)\
.show(100,truncate=False)
+------------+------+
|ground_truth|count |
+------------+------+
|O |137736|
|B-LOC |7125 |
|B-PER |6596 |
|B-ORG |6288 |
|I-PER |4528 |
|I-ORG |3704 |
|B-MISC |3437 |
|I-LOC |1157 |
|I-MISC |1155 |
+------------+------+
Test Data
In [11]:
with open ("eng.testa") as f:
test_data = f.read()
print (test_data[:500])
-DOCSTART- -X- -X- O
CRICKET NNP B-NP O
- : O O
LEICESTERSHIRE NNP B-NP B-ORG
TAKE NNP I-NP O
OVER IN B-PP O
AT NNP B-NP O
TOP NNP I-NP O
AFTER NNP I-NP O
INNINGS NNP I-NP O
VICTORY NN I-NP O
. . O O
LONDON NNP B-NP B-LOC
1996-08-30 CD I-NP O
West NNP B-NP B-MISC
Indian NNP I-NP I-MISC
all-rounder NN I-NP O
Phil NNP I-NP B-PER
Simmons NNP I-NP I-PER
took VBD B-VP O
four CD B-NP O
for IN B-PP O
38 CD B-NP O
on IN B-PP O
Friday NNP B-NP O
as IN B-PP O
Leicestershire NNP B-NP B-ORG
beat VBD B-VP
In [12]:
test_data = CoNLL().readDataset(spark, 'eng.testa')
test_data.show(3)
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| text| document| sentence| token| pos| label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[{document, 0, 64...|[{document, 0, 64...|[{token, 0, 6, CR...|[{pos, 0, 6, NNP,...|[{named_entity, 0...|
| LONDON 1996-08-30|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 5, LO...|[{pos, 0, 5, NNP,...|[{named_entity, 0...|
|West Indian all-r...|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 3, We...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows
In [13]:
test_data.count()
Out[13]:
3250
In [14]:
test_data.select(F.explode(F.arrays_zip("token.result","label.result")).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth")).groupBy("ground_truth").count().orderBy("count", ascending=False).show(100,truncate=False)
+------------+-----+
|ground_truth|count|
+------------+-----+
|O |42759|
|B-PER |1842 |
|B-LOC |1837 |
|B-ORG |1341 |
|I-PER |1307 |
|B-MISC |922 |
|I-ORG |751 |
|I-MISC |346 |
|I-LOC |257 |
+------------+-----+
NERDL Model with Glove_100d
In [15]:
glove_embeddings = WordEmbeddingsModel.pretrained()\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
In [16]:
glove_embeddings.transform(test_data).write.parquet('test_data_embeddings.parquet')
In [17]:
nerTagger = NerDLApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(8)\
.setLr(0.002)\
.setDropout(0.5)\
.setBatchSize(16)\
.setRandomSeed(0)\
.setVerbose(1)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setTestDataset('test_data_embeddings.parquet')\
.setEnableMemoryOptimizer(False)
ner_pipeline = Pipeline(stages=[
glove_embeddings,
nerTagger
])
In [19]:
%%time
ner_model = ner_pipeline.fit(train_data)
CPU times: user 10.6 s, sys: 1.08 s, total: 11.7 s
Wall time: 35min 21s
In [20]:
!cd ~/annotator_logs/ && ls -lt
total 16
-rw-r--r-- 1 root root 13178 Feb 6 17:05 NerDLApproach_c5bf4e4c6211.log
In [21]:
!cat ~/annotator_logs/NerDLApproach_c5bf4e4c6211.log
Name of the selected graph: ner-dl/blstm_10_100_128_120.pb
Training started - total epochs: 8 - lr: 0.002 - batch size: 16 - labels: 9 - chars: 82 - training examples: 11079
Epoch 1/8 started, lr: 0.002, dataset size: 11079
Epoch 1/8 - 159.93s - loss: 2234.436 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.74s
label tp fp fn prec rec f1
B-LOC 1695 94 142 0.94745666 0.92270005 0.93491447
I-ORG 528 76 223 0.8741722 0.7030626 0.77933586
I-MISC 255 88 91 0.7434402 0.7369942 0.74020314
I-LOC 189 14 68 0.9310345 0.73540854 0.8217391
I-PER 1270 59 37 0.95560575 0.9716909 0.9635812
B-MISC 797 142 125 0.84877527 0.8644252 0.85652876
B-ORG 1139 170 202 0.8701299 0.8493661 0.85962266
B-PER 1802 176 40 0.91102123 0.9782845 0.94345546
tp: 7675 fp: 819 fn: 928 labels: 8
Macro-average prec: 0.8852045, rec: 0.84524155, f1: 0.86476153
Micro-average prec: 0.903579, rec: 0.8921307, f1: 0.8978184
Epoch 2/8 started, lr: 0.0019900498, dataset size: 11079
Name of the selected graph: ner-dl/blstm_10_100_128_120.pb
Training started - total epochs: 8 - lr: 0.002 - batch size: 16 - labels: 9 - chars: 82 - training examples: 11079
Epoch 1/8 started, lr: 0.002, dataset size: 11079
Epoch 2/8 - 246.28s - loss: 839.1736 - batches: 695
Quality on test dataset:
time to finish evaluation: 19.66s
label tp fp fn prec rec f1
B-LOC 1762 124 75 0.9342524 0.95917255 0.9465484
I-ORG 585 76 166 0.8850227 0.77896136 0.82861185
I-MISC 247 39 99 0.8636364 0.71387285 0.7816456
I-LOC 233 74 24 0.7589577 0.9066148 0.8262412
I-PER 1275 54 32 0.95936793 0.97551644 0.9673748
B-MISC 791 70 131 0.9186992 0.85791755 0.88726866
B-ORG 1150 151 191 0.88393545 0.857569 0.8705526
B-PER 1800 147 42 0.9244992 0.9771987 0.9501188
tp: 7843 fp: 735 fn: 760 labels: 8
Macro-average prec: 0.89104635, rec: 0.8783529, f1: 0.88465416
Micro-average prec: 0.9143157, rec: 0.9116587, f1: 0.9129852
Epoch 3/8 started, lr: 0.001980198, dataset size: 11079
Epoch 1/8 - 254.10s - loss: 2203.116 - batches: 695
Quality on test dataset:
time to finish evaluation: 22.30s
label tp fp fn prec rec f1
B-LOC 1660 82 177 0.95292765 0.90364724 0.9276334
I-ORG 560 123 191 0.81991214 0.74567246 0.781032
I-MISC 227 65 119 0.7773973 0.65606934 0.7115987
I-LOC 155 10 102 0.93939394 0.6031128 0.73459715
I-PER 1259 60 48 0.954511 0.96327466 0.9588728
B-MISC 762 110 160 0.8738532 0.82646424 0.8494984
B-ORG 1160 237 181 0.83035076 0.8650261 0.8473338
B-PER 1785 170 57 0.9130435 0.96905535 0.94021595
tp: 7568 fp: 857 fn: 1035 labels: 8
Macro-average prec: 0.88267374, rec: 0.81654024, f1: 0.84832007
Micro-average prec: 0.89827895, rec: 0.87969315, f1: 0.8888889
Epoch 2/8 started, lr: 0.0019900498, dataset size: 11079
Epoch 3/8 - 257.88s - loss: 610.81525 - batches: 695
Quality on test dataset:
time to finish evaluation: 18.07s
label tp fp fn prec rec f1
B-LOC 1764 104 73 0.9443255 0.9602613 0.9522267
I-ORG 640 140 111 0.82051283 0.85219705 0.8360548
I-MISC 227 22 119 0.9116466 0.65606934 0.7630252
I-LOC 223 43 34 0.8383459 0.8677043 0.8527725
I-PER 1265 31 42 0.97608024 0.96786535 0.9719554
B-MISC 785 62 137 0.9268005 0.85141 0.8875071
B-ORG 1207 174 134 0.87400436 0.90007454 0.8868479
B-PER 1795 94 47 0.9502382 0.97448426 0.96220857
tp: 7906 fp: 670 fn: 697 labels: 8
Macro-average prec: 0.90524435, rec: 0.87875825, f1: 0.8918047
Micro-average prec: 0.921875, rec: 0.91898173, f1: 0.9204261
Epoch 4/8 started, lr: 0.0019704434, dataset size: 11079
Epoch 2/8 - 252.19s - loss: 828.8285 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.37s
label tp fp fn prec rec f1
B-LOC 1722 67 115 0.9625489 0.93739796 0.9498069
I-ORG 624 134 127 0.823219 0.83089215 0.8270378
I-MISC 230 30 116 0.88461536 0.6647399 0.75907594
I-LOC 199 13 58 0.9386792 0.77431905 0.8486141
I-PER 1274 44 33 0.9666161 0.97475135 0.9706667
B-MISC 787 70 135 0.9183197 0.85357916 0.8847667
B-ORG 1212 204 129 0.8559322 0.9038031 0.87921643
B-PER 1807 109 35 0.94311064 0.98099893 0.9616817
tp: 7855 fp: 671 fn: 748 labels: 8
Macro-average prec: 0.9116301, rec: 0.8650602, f1: 0.88773483
Micro-average prec: 0.9212996, rec: 0.9130536, f1: 0.917158
Epoch 3/8 started, lr: 0.001980198, dataset size: 11079
Epoch 4/8 - 250.45s - loss: 512.68085 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.78s
label tp fp fn prec rec f1
B-LOC 1767 78 70 0.95772356 0.9618944 0.9598045
I-ORG 658 75 93 0.89768076 0.8761651 0.8867924
I-MISC 257 38 89 0.87118644 0.74277455 0.801872
I-LOC 229 18 28 0.9271255 0.8910506 0.9087302
I-PER 1264 21 43 0.9836576 0.9671002 0.97530866
B-MISC 841 127 81 0.86880165 0.9121475 0.8899471
B-ORG 1202 114 139 0.9133739 0.89634603 0.90477985
B-PER 1799 87 43 0.95387065 0.9766558 0.9651288
tp: 8017 fp: 558 fn: 586 labels: 8
Macro-average prec: 0.92167753, rec: 0.9030168, f1: 0.9122517
Micro-average prec: 0.9349271, rec: 0.9318842, f1: 0.93340325
Epoch 5/8 started, lr: 0.0019607844, dataset size: 11079
Epoch 3/8 - 252.61s - loss: 604.5874 - batches: 695
Quality on test dataset:
time to finish evaluation: 18.28s
label tp fp fn prec rec f1
B-LOC 1764 112 73 0.9402985 0.9602613 0.95017505
I-ORG 614 84 137 0.87965614 0.8175766 0.847481
I-MISC 244 34 102 0.8776978 0.70520234 0.78205127
I-LOC 220 29 37 0.88353413 0.8560311 0.8695652
I-PER 1268 38 39 0.9709035 0.97016066 0.97053194
B-MISC 799 96 123 0.89273745 0.8665944 0.87947166
B-ORG 1205 123 136 0.9073795 0.8985832 0.90295994
B-PER 1792 110 50 0.94216615 0.97285557 0.95726496
tp: 7906 fp: 626 fn: 697 labels: 8
Macro-average prec: 0.9117967, rec: 0.88090813, f1: 0.89608634
Micro-average prec: 0.9266292, rec: 0.91898173, f1: 0.92278963
Epoch 4/8 started, lr: 0.0019704434, dataset size: 11079
Epoch 5/8 - 257.56s - loss: 437.73123 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.89s
label tp fp fn prec rec f1
B-LOC 1806 163 31 0.91721684 0.9831247 0.94902784
I-ORG 606 26 145 0.95886075 0.8069241 0.8763557
I-MISC 287 99 59 0.7435233 0.82947975 0.78415304
I-LOC 233 54 24 0.8118467 0.9066148 0.85661757
I-PER 1273 26 34 0.9799846 0.9739862 0.9769762
B-MISC 846 146 76 0.8528226 0.9175705 0.8840125
B-ORG 1149 37 192 0.9688027 0.85682327 0.9093787
B-PER 1797 77 45 0.9589114 0.97557 0.96716905
tp: 7997 fp: 628 fn: 606 labels: 8
Macro-average prec: 0.8989962, rec: 0.90626174, f1: 0.9026143
Micro-average prec: 0.9271884, rec: 0.92955947, f1: 0.9283724
Epoch 6/8 started, lr: 0.0019512196, dataset size: 11079
Epoch 4/8 - 255.39s - loss: 508.80334 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.63s
label tp fp fn prec rec f1
B-LOC 1799 270 38 0.8695022 0.9793141 0.921147
I-ORG 616 97 135 0.86395514 0.82023966 0.8415301
I-MISC 253 33 93 0.88461536 0.73121387 0.8006329
I-LOC 236 117 21 0.66855526 0.91828793 0.77377045
I-PER 1256 18 51 0.98587126 0.96097934 0.9732662
B-MISC 799 66 123 0.92369944 0.8665944 0.89423615
B-ORG 1162 106 179 0.9164038 0.86651754 0.89076275
B-PER 1754 52 88 0.9712071 0.95222586 0.96162283
tp: 7875 fp: 759 fn: 728 labels: 8
Macro-average prec: 0.8854762, rec: 0.8869216, f1: 0.8861983
Micro-average prec: 0.91209173, rec: 0.91537833, f1: 0.9137321
Epoch 5/8 started, lr: 0.0019607844, dataset size: 11079
Epoch 6/8 - 262.11s - loss: 382.8735 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.92s
label tp fp fn prec rec f1
B-LOC 1749 61 88 0.96629834 0.9520958 0.95914453
I-ORG 682 136 69 0.83374083 0.9081225 0.8693435
I-MISC 268 40 78 0.8701299 0.7745665 0.81957185
I-LOC 215 14 42 0.93886465 0.83657587 0.8847737
I-PER 1280 39 27 0.97043216 0.979342 0.97486675
B-MISC 837 96 85 0.8971061 0.90780914 0.90242594
B-ORG 1232 120 109 0.9112426 0.9187174 0.91496474
B-PER 1795 93 47 0.9507415 0.97448426 0.96246654
tp: 8058 fp: 599 fn: 545 labels: 8
Macro-average prec: 0.91731954, rec: 0.9064642, f1: 0.9118596
Micro-average prec: 0.9308074, rec: 0.93665, f1: 0.9337195
Epoch 7/8 started, lr: 0.0019417477, dataset size: 11079
Epoch 5/8 - 263.75s - loss: 450.50388 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.58s
label tp fp fn prec rec f1
B-LOC 1749 64 88 0.9646994 0.9520958 0.95835614
I-ORG 689 180 62 0.79286534 0.9174434 0.85061723
I-MISC 275 83 71 0.7681564 0.79479766 0.78124994
I-LOC 210 16 47 0.9292035 0.8171206 0.8695652
I-PER 1271 33 36 0.97469324 0.972456 0.9735733
B-MISC 825 103 97 0.88900864 0.8947939 0.89189196
B-ORG 1239 127 102 0.90702784 0.9239374 0.9154045
B-PER 1791 71 51 0.96186894 0.9723127 0.96706253
tp: 8049 fp: 677 fn: 554 labels: 8
Macro-average prec: 0.89844036, rec: 0.90561974, f1: 0.90201575
Micro-average prec: 0.9224158, rec: 0.93560386, f1: 0.92896307
Epoch 6/8 started, lr: 0.0019512196, dataset size: 11079
Epoch 7/8 - 262.34s - loss: 330.9146 - batches: 695
Quality on test dataset:
time to finish evaluation: 18.09s
label tp fp fn prec rec f1
B-LOC 1760 74 77 0.95965105 0.9580838 0.9588668
I-ORG 630 36 121 0.9459459 0.8388815 0.88920254
I-MISC 283 93 63 0.75265956 0.8179191 0.7839335
I-LOC 225 20 32 0.9183673 0.8754864 0.8964143
I-PER 1273 32 34 0.97547895 0.9739862 0.974732
B-MISC 837 113 85 0.8810526 0.90780914 0.8942308
B-ORG 1230 96 111 0.9276018 0.91722596 0.92238474
B-PER 1801 70 41 0.9625869 0.9777416 0.97010505
tp: 8039 fp: 534 fn: 564 labels: 8
Macro-average prec: 0.915418, rec: 0.9083917, f1: 0.91189134
Micro-average prec: 0.9377114, rec: 0.93444145, f1: 0.93607366
Epoch 8/8 started, lr: 0.0019323673, dataset size: 11079
Epoch 6/8 - 264.34s - loss: 384.8886 - batches: 695
Quality on test dataset:
time to finish evaluation: 18.07s
label tp fp fn prec rec f1
B-LOC 1772 84 65 0.95474136 0.96461624 0.9596534
I-ORG 635 50 116 0.9270073 0.8455393 0.88440114
I-MISC 274 78 72 0.77840906 0.7919075 0.7851002
I-LOC 228 19 29 0.9230769 0.8871595 0.9047619
I-PER 1273 32 34 0.97547895 0.9739862 0.974732
B-MISC 842 125 80 0.8707342 0.9132321 0.8914769
B-ORG 1218 86 123 0.93404907 0.9082774 0.920983
B-PER 1791 65 51 0.96497846 0.9723127 0.9686317
tp: 8033 fp: 539 fn: 570 labels: 8
Macro-average prec: 0.9160594, rec: 0.90712893, f1: 0.9115723
Micro-average prec: 0.93712085, rec: 0.933744, f1: 0.9354294
Epoch 7/8 started, lr: 0.0019417477, dataset size: 11079
Epoch 8/8 - 266.21s - loss: 301.41052 - batches: 695
Quality on test dataset:
time to finish evaluation: 18.45s
label tp fp fn prec rec f1
B-LOC 1768 68 69 0.962963 0.96243876 0.96270084
I-ORG 658 49 93 0.9306931 0.8761651 0.9026063
I-MISC 267 56 79 0.8266254 0.7716763 0.7982063
I-LOC 228 14 29 0.94214875 0.8871595 0.91382766
I-PER 1272 35 35 0.9732211 0.9732211 0.9732211
B-MISC 834 98 88 0.8948498 0.9045553 0.8996764
B-ORG 1239 97 102 0.9273952 0.9239374 0.925663
B-PER 1806 94 36 0.9505263 0.98045605 0.9652592
tp: 8072 fp: 511 fn: 531 labels: 8
Macro-average prec: 0.9260528, rec: 0.90995115, f1: 0.9179313
Micro-average prec: 0.9404637, rec: 0.93827736, f1: 0.9393693
Epoch 7/8 - 256.62s - loss: 335.06775 - batches: 695
Quality on test dataset:
time to finish evaluation: 8.79s
label tp fp fn prec rec f1
B-LOC 1791 128 46 0.9332986 0.9749592 0.95367414
I-ORG 639 78 112 0.8912134 0.8508655 0.8705722
I-MISC 262 48 84 0.8451613 0.75722545 0.7987805
I-LOC 238 60 19 0.7986577 0.92607003 0.8576577
I-PER 1260 19 47 0.9851446 0.9640398 0.97447795
B-MISC 811 72 111 0.9184598 0.8796095 0.89861494
B-ORG 1215 95 126 0.92748094 0.90604025 0.9166353
B-PER 1786 56 56 0.96959823 0.96959823 0.96959823
tp: 8002 fp: 556 fn: 601 labels: 8
Macro-average prec: 0.90862685, rec: 0.903551, f1: 0.9060818
Micro-average prec: 0.93503153, rec: 0.9301407, f1: 0.9325797
Epoch 8/8 started, lr: 0.0019323673, dataset size: 11079
Epoch 8/8 - 133.22s - loss: 299.64578 - batches: 695
Quality on test dataset:
time to finish evaluation: 8.91s
label tp fp fn prec rec f1
B-LOC 1746 56 91 0.9689234 0.9504627 0.95960426
I-ORG 673 77 78 0.8973333 0.8961385 0.8967355
I-MISC 270 43 76 0.8626198 0.7803468 0.8194234
I-LOC 223 10 34 0.95708156 0.8677043 0.9102041
I-PER 1272 41 35 0.9687738 0.9732211 0.9709923
B-MISC 832 109 90 0.88416576 0.9023861 0.893183
B-ORG 1264 143 77 0.8983653 0.94258016 0.9199418
B-PER 1801 76 41 0.95950985 0.9777416 0.9685399
tp: 8081 fp: 555 fn: 522 labels: 8
Macro-average prec: 0.9245966, rec: 0.9113227, f1: 0.91791165
Micro-average prec: 0.93573415, rec: 0.9393235, f1: 0.9375254
In [22]:
import pyspark.sql.functions as F
predictions = ner_model.transform(test_data)
predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"),
F.expr("cols['2']").alias("prediction")).show(truncate=False)
+--------------+------------+----------+
|token |ground_truth|prediction|
+--------------+------------+----------+
|CRICKET |O |O |
|- |O |O |
|LEICESTERSHIRE|B-ORG |B-ORG |
|TAKE |O |O |
|OVER |O |O |
|AT |O |O |
|TOP |O |O |
|AFTER |O |O |
|INNINGS |O |O |
|VICTORY |O |O |
|. |O |O |
|LONDON |B-LOC |B-LOC |
|1996-08-30 |O |O |
|West |B-MISC |B-MISC |
|Indian |I-MISC |I-MISC |
|all-rounder |O |O |
|Phil |B-PER |B-PER |
|Simmons |I-PER |I-PER |
|took |O |O |
|four |O |O |
+--------------+------------+----------+
only showing top 20 rows
In [23]:
from sklearn.metrics import classification_report
preds_df = predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"),
F.expr("cols['2']").alias("prediction")).toPandas()
print (classification_report(preds_df['ground_truth'], preds_df['prediction']))
precision recall f1-score support
B-LOC 0.97 0.95 0.96 1837
B-MISC 0.88 0.90 0.89 922
B-ORG 0.90 0.94 0.92 1341
B-PER 0.96 0.98 0.97 1842
I-LOC 0.96 0.87 0.91 257
I-MISC 0.86 0.78 0.82 346
I-ORG 0.90 0.90 0.90 751
I-PER 0.97 0.97 0.97 1307
O 1.00 1.00 1.00 42759
accuracy 0.99 51362
macro avg 0.93 0.92 0.93 51362
weighted avg 0.99 0.99 0.99 51362
Saving the Trained Model
In [24]:
ner_model.stages
Out[24]:
[WORD_EMBEDDINGS_MODEL_48cffc8b9a76, NerDLModel_6a88a8ead3fd]
In [25]:
ner_model.stages[1].write().overwrite().save("/content/drive/MyDrive/SparkNLPTask/Ner_glove_100d_e8_b16_lr0.02")
Prediction Pipeline
In [28]:
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence = SentenceDetector()\
.setInputCols(['document'])\
.setOutputCol('sentence')
token = Tokenizer()\
.setInputCols(['sentence'])\
.setOutputCol('token')
glove_embeddings = WordEmbeddingsModel.pretrained()\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
loaded_ner_model = NerDLModel.load("/content/drive/MyDrive/SparkNLPTask/Ner_glove_100d_e8_b16_lr0.02")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_span")
ner_prediction_pipeline = Pipeline(stages = [
document,
sentence,
token,
glove_embeddings,
loaded_ner_model,
converter
])
empty_data = spark.createDataFrame([['']]).toDF("text")
prediction_model = ner_prediction_pipeline.fit(empty_data)
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
In [33]:
text = '''
The final has its own Merseyside subplot, as it will pit Liverpool forwards Mo Salah (of Egypt: pictured above, in white, in the semi-final) and Sadio Mané (of Senegal) against each other. They are just two of the African stars to play for European clubs—the world’s strongest. In fact, only four teams in the English Premier League don’t have a player from the continent. Besides Mr Salah and Mr Mané, Riyad Mahrez of Algeria is at Manchester City, Wilfred Ndidi of Nigeria and Chelsea boasts Edouard Mendy, Senegal’s goalkeeper, and Hakim Ziyech of Morocco. In Italy’s Serie A, Kalidou Koulibaly of Senegal plays for Napoli and Franck Kessie of the Ivory Coast turns out for AC Milan. Eric Maxim Choupo-Moting of Cameroon and Bouna Sarr of Senegal both play for Bayern Munich, the dominant club in Germany’s Bundesliga.
'''
sample_data = spark.createDataFrame([[text]]).toDF("text")
sample_data.show(truncate=False)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
The final has its own Merseyside subplot, as it will pit Liverpool forwards Mo Salah (of Egypt: pictured above, in white, in the semi-final) and Sadio Mané (of Senegal) against each other. They are just two of the African stars to play for European clubs—the world’s strongest. In fact, only four teams in the English Premier League don’t have a player from the continent. Besides Mr Salah and Mr Mané, Riyad Mahrez of Algeria is at Manchester City, Wilfred Ndidi of Nigeria and Chelsea boasts Edouard Mendy, Senegal’s goalkeeper, and Hakim Ziyech of Morocco. In Italy’s Serie A, Kalidou Koulibaly of Senegal plays for Napoli and Franck Kessie of the Ivory Coast turns out for AC Milan. Eric Maxim Choupo-Moting of Cameroon and Bouna Sarr of Senegal both play for Bayern Munich, the dominant club in Germany’s Bundesliga.
|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
In [34]:
preds = prediction_model.transform(sample_data)
result_df = preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \
.select(F.expr("entities['0']").alias("chunk"),
F.expr("entities['1'].entity").alias("entity")).show(truncate=False)
+---------------+------+
|chunk |entity|
+---------------+------+
|Merseyside |ORG |
|Liverpool |ORG |
|Mo Salah |PER |
|Egypt |LOC |
|Sadio Mané |PER |
|Senegal |LOC |
|African |MISC |
|European |MISC |
|English |MISC |
|Premier League |ORG |
|Mr Salah |PER |
|Mr Mané |PER |
|Riyad Mahrez |PER |
|Algeria |LOC |
|Manchester City|LOC |
|Wilfred Ndidi |PER |
|Nigeria |LOC |
|Chelsea |ORG |
|Edouard Mendy |PER |
|Senegal’s |PER |
+---------------+------+
only showing top 20 rows
In [35]:
from sparknlp.base import LightPipeline
light_model = LightPipeline(prediction_model)
result = light_model.annotate(text)
list(zip(result['token'], result['ner']))
Out[35]:
[('The', 'O'),
('final', 'O'),
('has', 'O'),
('its', 'O'),
('own', 'O'),
('Merseyside', 'B-ORG'),
('subplot', 'O'),
(',', 'O'),
('as', 'O'),
('it', 'O'),
('will', 'O'),
('pit', 'O'),
('Liverpool', 'B-ORG'),
('forwards', 'O'),
('Mo', 'B-PER'),
('Salah', 'I-PER'),
('(', 'O'),
('of', 'O'),
('Egypt', 'B-LOC'),
(':', 'O'),
('pictured', 'O'),
('above', 'O'),
(',', 'O'),
('in', 'O'),
('white', 'O'),
(',', 'O'),
('in', 'O'),
('the', 'O'),
('semi-final', 'O'),
(')', 'O'),
('and', 'O'),
('Sadio', 'B-PER'),
('Mané', 'I-PER'),
('(', 'O'),
('of', 'O'),
('Senegal', 'B-LOC'),
(')', 'O'),
('against', 'O'),
('each', 'O'),
('other', 'O'),
('.', 'O'),
('They', 'O'),
('are', 'O'),
('just', 'O'),
('two', 'O'),
('of', 'O'),
('the', 'O'),
('African', 'B-MISC'),
('stars', 'O'),
('to', 'O'),
('play', 'O'),
('for', 'O'),
('European', 'B-MISC'),
('clubs—the', 'O'),
('world’s', 'O'),
('strongest', 'O'),
('.', 'O'),
('In', 'O'),
('fact', 'O'),
(',', 'O'),
('only', 'O'),
('four', 'O'),
('teams', 'O'),
('in', 'O'),
('the', 'O'),
('English', 'B-MISC'),
('Premier', 'B-ORG'),
('League', 'I-ORG'),
('don’t', 'O'),
('have', 'O'),
('a', 'O'),
('player', 'O'),
('from', 'O'),
('the', 'O'),
('continent', 'O'),
('.', 'O'),
('Besides', 'O'),
('Mr', 'B-PER'),
('Salah', 'I-PER'),
('and', 'O'),
('Mr', 'B-PER'),
('Mané', 'I-PER'),
(',', 'O'),
('Riyad', 'B-PER'),
('Mahrez', 'I-PER'),
('of', 'O'),
('Algeria', 'B-LOC'),
('is', 'O'),
('at', 'O'),
('Manchester', 'B-LOC'),
('City', 'I-LOC'),
(',', 'O'),
('Wilfred', 'B-PER'),
('Ndidi', 'I-PER'),
('of', 'O'),
('Nigeria', 'B-LOC'),
('and', 'O'),
('Chelsea', 'B-ORG'),
('boasts', 'O'),
('Edouard', 'B-PER'),
('Mendy', 'I-PER'),
(',', 'O'),
('Senegal’s', 'B-PER'),
('goalkeeper', 'O'),
(',', 'O'),
('and', 'O'),
('Hakim', 'B-PER'),
('Ziyech', 'I-PER'),
('of', 'O'),
('Morocco', 'B-LOC'),
('.', 'O'),
('In', 'O'),
('Italy’s', 'B-MISC'),
('Serie', 'I-MISC'),
('A', 'I-MISC'),
(',', 'O'),
('Kalidou', 'B-PER'),
('Koulibaly', 'I-PER'),
('of', 'O'),
('Senegal', 'B-LOC'),
('plays', 'O'),
('for', 'O'),
('Napoli', 'B-ORG'),
('and', 'O'),
('Franck', 'B-PER'),
('Kessie', 'I-PER'),
('of', 'O'),
('the', 'O'),
('Ivory', 'B-LOC'),
('Coast', 'I-LOC'),
('turns', 'O'),
('out', 'O'),
('for', 'O'),
('AC', 'B-ORG'),
('Milan', 'I-ORG'),
('.', 'O'),
('Eric', 'B-PER'),
('Maxim', 'I-PER'),
('Choupo-Moting', 'I-PER'),
('of', 'O'),
('Cameroon', 'B-LOC'),
('and', 'O'),
('Bouna', 'B-PER'),
('Sarr', 'I-PER'),
('of', 'O'),
('Senegal', 'B-LOC'),
('both', 'O'),
('play', 'O'),
('for', 'O'),
('Bayern', 'B-ORG'),
('Munich', 'I-ORG'),
(',', 'O'),
('the', 'O'),
('dominant', 'O'),
('club', 'O'),
('in', 'O'),
('Germany’s', 'B-MISC'),
('Bundesliga', 'I-MISC'),
('.', 'O')]
In [37]:
import pandas as pd
result = light_model.fullAnnotate(text)
ner_df= pd.DataFrame([(int(x.metadata['sentence']), x.result, x.begin, x.end, y.result) for x,y in zip(result[0]["token"], result[0]["ner"])],
columns=['sent_id','token','start','end','ner'])
ner_df.head(15)
Out[37]:
sent_id token start end ner
0 0 The 1 3 O
1 0 final 5 9 O
2 0 has 11 13 O
3 0 its 15 17 O
4 0 own 19 21 O
5 0 Merseyside 23 32 B-ORG
6 0 subplot 34 40 O
7 0 , 41 41 O
8 0 as 43 44 O
9 0 it 46 47 O
10 0 will 49 52 O
11 0 pit 54 56 O
12 0 Liverpool 58 66 B-ORG
13 0 forwards 68 75 O
14 0 Mo 77 78 B-PER
Highlight Entities
In [38]:
ann_text = light_model.fullAnnotate(text)[0]
ann_text.keys()
Out[38]:
dict_keys(['document', 'ner_span', 'token', 'ner', 'embeddings', 'sentence'])
In [39]:
from sparknlp_display import NerVisualizer
visualiser = NerVisualizer()
print ('Standard Output')
visualiser.display(ann_text, label_col='ner_span', document_col='document')
Standard Output
The final has its own Merseyside ORG subplot, as it will pit Liverpool ORG forwards Mo Salah PER (of Egypt LOC: pictured above, in white, in the semi-final) and Sadio Mané PER (of Senegal LOC) against each other. They are just two of the African MISC stars to play for European MISC clubs—the world’s strongest. In fact, only four teams in the English MISC Premier League ORG don’t have a player from the continent. Besides Mr Salah PER and Mr Mané PER, Riyad Mahrez PER of Algeria LOC is at Manchester City LOC, Wilfred Ndidi PER of Nigeria LOC and Chelsea ORG boasts Edouard Mendy PER, Senegal’s PER goalkeeper, and Hakim Ziyech PER of Morocco LOC. In Italy’s Serie A MISC, Kalidou Koulibaly PER of Senegal LOC plays for Napoli ORG and Franck Kessie PER of the Ivory Coast LOC turns out for AC Milan ORG. Eric Maxim Choupo-Moting PER of Cameroon LOC and Bouna Sarr PER of Senegal LOC both play for Bayern Munich ORG, the dominant club in Germany’s Bundesliga MISC.
Streamlit
In [14]:
! pip install -q pyspark==3.1.2 spark-nlp
! pip install -q spark-nlp-display
In [ ]:
!pip install streamlit
!pip install pyngrok==4.1.1
In [2]:
! wget https://raw.githubusercontent.com/gokhanturer/JSL/main/streamlit_me_ner_model.py
--2022-02-06 22:39:33-- https://raw.githubusercontent.com/gokhanturer/JSL/main/streamlit_me_ner_model.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7979 (7.8K) [text/plain]
Saving to: ‘streamlit_me_ner_model.py.3’
streamlit_me_ner_mo 100%[===================>] 7.79K --.-KB/s in 0s
2022-02-06 22:39:34 (93.4 MB/s) - ‘streamlit_me_ner_model.py.3’ saved [7979/7979]
In [3]:
!ngrok authtoken 24jtZ2Watn1mc1bSG6v19fel7p1_2bYeRjRkniKqqhfgRs6ub
Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml
In [5]:
!streamlit run streamlit_me_ner_model.py &>/dev/null&
In [6]:
from pyngrok import ngrok
public_url = ngrok.connect(port='8501')
public_url
Out[6]:
'http://2d54-34-125-109-11.ngrok.io'
In [7]:
!killall ngrok
public_url = ngrok.connect(port='8501')
public_url
Out[7]:
'http://df30-34-125-109-11.ngrok.io'
```
## Results
```bash
+---------------+------+
|chunk |entity|
+---------------+------+
|Merseyside |ORG |
|Liverpool |ORG |
|Mo Salah |PER |
|Egypt |LOC |
|Sadio Mané |PER |
|Senegal |LOC |
|African |MISC |
|European |MISC |
|English |MISC |
|Premier League |ORG |
|Mr Salah |PER |
|Mr Mané |PER |
|Riyad Mahrez |PER |
|Algeria |LOC |
|Manchester City|LOC |
|Wilfred Ndidi |PER |
|Nigeria |LOC |
|Chelsea |ORG |
|Edouard Mendy |PER |
|Senegal’s |PER |
+---------------+------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|Ner_conll2003_100d|
|Type:|ner|
|Compatibility:|Spark NLP 3.1.2+|
|License:|Open Source|
|Edition:|Community|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|14.3 MB|
|Dependencies:|glove100d|
## References
This model is trained based on data from :
https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa
## Benchmarking
```bash
label precision recall f1-score support
B-LOC 0.97 0.95 0.96 1837
B-MISC 0.88 0.90 0.89 922
B-ORG 0.90 0.94 0.92 1341
B-PER 0.96 0.98 0.97 1842
I-LOC 0.96 0.87 0.91 257
I-MISC 0.86 0.78 0.82 346
I-ORG 0.90 0.90 0.90 751
I-PER 0.97 0.97 0.97 1307
O 1.00 1.00 1.00 42759
accuracy - - 0.99 51362
macro-avg 0.93 0.92 0.93 51362
weighted-avg 0.99 0.99 0.99 51362
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from akr)
author: John Snow Labs
name: distilbert_qa_akr_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `akr`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_akr_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769755011.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_akr_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769755011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_akr_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_akr_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_akr_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/akr/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English asr_wav2vec2_large_960h TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: asr_wav2vec2_large_960h
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h` is a English model originally trained by facebook.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_960h_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_en_4.2.0_3.0_1664016568413.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_en_4.2.0_3.0_1664016568413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_960h", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_960h", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_960h|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|755.4 MB|
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_el12
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el12` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el12_en_4.3.0_3.0_1675119407178.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el12_en_4.3.0_3.0_1675119407178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_el12","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_el12","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_el12|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|184.0 MB|
## References
- https://huggingface.co/google/t5-efficient-small-el12
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English BertForQuestionAnswering model (from Sotireas)
author: John Snow Labs
name: bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext-ContaminationQAmodel_PubmedBERT` is a English model orginally trained by `Sotireas`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT_en_4.0.0_3.0_1654176486544.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT_en_4.0.0_3.0_1654176486544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|408.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Sotireas/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext-ContaminationQAmodel_PubmedBERT
---
layout: model
title: Legal Preamble Clause Binary Classifier
author: John Snow Labs
name: legclf_preamble_clause
date: 2023-02-13
tags: [en, legal, classification, preamble, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `preamble` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`preamble`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_preamble_clause_en_1.0.0_3.0_1676302301456.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_preamble_clause_en_1.0.0_3.0_1676302301456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[preamble]|
|[other]|
|[other]|
|[preamble]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_preamble_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 0.88 0.93 16
preamble 0.91 1.00 0.95 21
accuracy - - 0.95 37
macro-avg 0.96 0.94 0.94 37
weighted-avg 0.95 0.95 0.95 37
```
---
layout: model
title: Pipeline to Mapping SNOMED Codes with Their Corresponding ICDO Codes
author: John Snow Labs
name: snomed_icdo_mapping
date: 2022-06-27
tags: [snomed, icdo, pipeline, chunk_mapper, clinical, licensed, en]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of `snomed_icdo_mapper` model.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapping_en_3.5.3_3.0_1656364941154.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapping_en_3.5.3_3.0_1656364941154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models")
result= pipeline.fullAnnotate("10376009 2026006 26638004")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models")
val result= pipeline.fullAnnotate("10376009 2026006 26638004")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.snomed_to_icdo.pipe").predict("""10376009 2026006 26638004""")
```
## Results
```bash
| | snomed_code | icdo_code |
|---:|:------------------------------|:-------------------------|
| 0 | 10376009 | 2026006 | 26638004 | 8050/2 | 9014/0 | 8322/0 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|snomed_icdo_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|208.7 KB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ChunkMapperModel
---
layout: model
title: Tagalog RoBERTa Embeddings (Base)
author: John Snow Labs
name: roberta_embeddings_roberta_tagalog_base
date: 2022-04-14
tags: [roberta, embeddings, tl, open_source]
task: Embeddings
language: tl
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-tagalog-base` is a Tagalog model orginally trained by `jcblaise`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_tagalog_base_tl_3.4.2_3.0_1649948855487.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_tagalog_base_tl_3.4.2_3.0_1649948855487.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_tagalog_base","tl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Gustung-gusto ko ang Spark NLP."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_tagalog_base","tl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Gustung-gusto ko ang Spark NLP.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("tl.embed.roberta_tagalog_base").predict("""Gustung-gusto ko ang Spark NLP.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_roberta_tagalog_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|tl|
|Size:|410.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/jcblaise/roberta-tagalog-base
- https://blaisecruz.com
---
layout: model
title: Western Frisian BertForMaskedLM Base Cased model (from GroNLP)
author: John Snow Labs
name: bert_embeddings_base_dutch_cased_frisian
date: 2022-12-02
tags: [fy, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: fy
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-dutch-cased-frisian` is a Western Frisian model originally trained by `GroNLP`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_dutch_cased_frisian_fy_4.2.4_3.0_1670016581644.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_dutch_cased_frisian_fy_4.2.4_3.0_1670016581644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_dutch_cased_frisian","fy") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_dutch_cased_frisian","fy")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_dutch_cased_frisian|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|fy|
|Size:|351.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/GroNLP/bert-base-dutch-cased-frisian
- https://arxiv.org/abs/2105.02855
- https://github.com/wietsedv/low-resource-adapt
- https://github.com/wietsedv/bertje
---
layout: model
title: English BertForQuestionAnswering model (from ncduy)
author: John Snow Labs
name: bert_qa_MiniLM_L12_H384_uncased_finetuned_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MiniLM-L12-H384-uncased-finetuned-squad` is a English model orginally trained by `ncduy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_MiniLM_L12_H384_uncased_finetuned_squad_en_4.0.0_3.0_1654178848040.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_MiniLM_L12_H384_uncased_finetuned_squad_en_4.0.0_3.0_1654178848040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_MiniLM_L12_H384_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_MiniLM_L12_H384_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.mini_lm_base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_MiniLM_L12_H384_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|124.3 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/ncduy/MiniLM-L12-H384-uncased-finetuned-squad
---
layout: model
title: ALBERT Large CoNNL-03 NER Pipeline
author: John Snow Labs
name: albert_large_token_classifier_conll03_pipeline
date: 2022-06-19
tags: [open_source, ner, token_classifier, albert, conll03, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [albert_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_large_token_classifier_conll03_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653727302.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653727302.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs."))
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PER |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_large_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|64.4 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- AlbertForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Sentence Entity Resolver for ICD-O (sbiobertresolve_icdo_augmented)
author: John Snow Labs
name: sbiobertresolve_icdo_augmented
date: 2022-06-06
tags: [licensed, clinical, en, icdo, entity_resolution]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.5.2
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted clinical entities to ICD-O codes using `sbiobert_base_cased_mli` Sentence BERT Embeddings. Given an oncological entity found in the text (via NER models like `ner_jsl`), it returns top terms and resolutions along with the corresponding ICD-O codes to present more granularity with respect to body parts mentioned. It also returns the original `Topography` and `Histology` codes, and their descriptions.
## Predicted Entities
`ICD-O Codes`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_augmented_en_3.5.2_3.0_1654546345691.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_augmented_en_3.5.2_3.0_1654546345691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")\
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["Oncological"])
c2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sentence_embeddings")\
resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icdo_augmented", "en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")\
resolver_pipeline = Pipeline(
stages = [
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter,
c2doc,
sbert_embedder,
resolver
])
data = spark.createDataFrame([["""TRAF6 is a putative oncogene in a variety of cancers including urothelial cancer , and malignant melanoma. WWP2 appears to regulate the expression of the well characterized tumor and tensin homolog (PTEN) in endometroid adenocarcinoma and squamous cell carcinoma."""]]).toDF("text")
result = resolver_pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Oncological"))
val c2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols("ner_chunk_doc")
.setOutputCol("sentence_embeddings")
val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icdo_augmented", "en", "clinical/models")
.setInputCols(Array("sentence_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val resolver_pipeline = new Pipeline().setStages(Array(document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter,
c2doc,
sbert_embedder,
resolver))
val data = Seq("""TRAF6 is a putative oncogene in a variety of cancers including urothelial cancer , and malignant melanoma. WWP2 appears to regulate the expression of the well characterized tumor and tensin homolog (PTEN) in endometroid adenocarcinoma and squamous cell carcinoma.""").toDS.toDF("text")
val results = resolver_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icdo_augmented").predict("""TRAF6 is a putative oncogene in a variety of cancers including urothelial cancer , and malignant melanoma. WWP2 appears to regulate the expression of the well characterized tumor and tensin homolog (PTEN) in endometroid adenocarcinoma and squamous cell carcinoma.""")
```
## Results
```bash
+--------------------------+-----------+---------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+
| chunk| entity|icdo_code| all_k_resolutions| all_k_codes|
+--------------------------+-----------+---------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+
| cancers|Oncological| 8000/3|cancer:::carcinoma:::carcinomatosis:::neoplasms:::ceruminous carcinoma::...|8000/3:::8010/3:::8010/9:::800:::8420/3:::8140/3:::8010/3-C76.0:::8010/6...|
| urothelial cancer|Oncological| 8120/3|urothelial carcinoma:::urothelial carcinoma in situ of urinary system:::...|8120/3:::8120/2-C68.9:::8010/3-C68.9:::8130/3-C68.9:::8070/3-C68.9:::813...|
| malignant melanoma|Oncological| 8720/3|malignant melanoma:::malignant melanoma, of skin:::malignant melanoma, o...|8720/3:::8720/3-C44.9:::8720/3-C06.9:::8720/3-C69.9:::8721/3:::8720/3-C0...|
| tumor|Oncological| 8000/1|tumor:::tumorlet:::tumor cells:::askin tumor:::tumor, secondary:::pilar ...|8000/1:::8040/1:::8001/1:::9365/3:::8000/6:::8103/0:::9364/3:::8940/0:::...|
|endometroid adenocarcinoma|Oncological| 8380/3|endometrioid adenocarcinoma:::endometrioid adenoma:::scirrhous adenocarc...|8380/3:::8380/0:::8141/3-C54.1:::8560/3-C54.1:::8260/3-C54.1:::8380/3-C5...|
| squamous cell carcinoma|Oncological| 8070/3|squamous cell carcinoma:::verrucous squamous cell carcinoma:::squamous c...|8070/3:::8051/3:::8070/2:::8052/3:::8070/3-C44.5:::8075/3:::8560/3:::807...|
+--------------------------+-----------+---------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icdo_augmented|
|Compatibility:|Healthcare NLP 3.5.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icdo_code]|
|Language:|en|
|Size:|175.7 MB|
|Case sensitive:|false|
## References
Trained on ICD-O Histology Behaviour dataset with sbiobert_base_cased_mli sentence embeddings. https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_dl12
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dl12` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl12_en_4.3.0_3.0_1675118545379.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl12_en_4.3.0_3.0_1675118545379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_dl12","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_dl12","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_dl12|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|196.2 MB|
## References
- https://huggingface.co/google/t5-efficient-small-dl12
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English image_classifier_vit_blocks ViTForImageClassification from lazyturtl
author: John Snow Labs
name: image_classifier_vit_blocks
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_blocks` is a English model originally trained by lazyturtl.
## Predicted Entities
`red color`, `orange color`, `green color`, `cyan color`, `yellow color`, `blue color`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_blocks_en_4.1.0_3.0_1660166657299.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_blocks_en_4.1.0_3.0_1660166657299.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_blocks", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_blocks", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_blocks|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Fast Neural Machine Translation Model from Danish to English
author: John Snow Labs
name: opus_mt_da_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, da, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `da`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_da_en_xx_2.7.0_2.4_1609167272759.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_da_en_xx_2.7.0_2.4_1609167272759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_da_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_da_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.da.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_da_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Indonesian RobertaForMaskedLM Small Cased model (from w11wo)
author: John Snow Labs
name: roberta_embeddings_indo_small
date: 2022-12-12
tags: [id, open_source, roberta_embeddings, robertaformaskedlm]
task: Embeddings
language: id
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indo-roberta-small` is a Indonesian model originally trained by `w11wo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indo_small_id_4.2.4_3.0_1670858716049.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indo_small_id_4.2.4_3.0_1670858716049.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indo_small","id") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indo_small","id")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_indo_small|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|id|
|Size:|313.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/w11wo/indo-roberta-small
- https://arxiv.org/abs/1907.11692
- https://github.com/sgugger
- https://w11wo.github.io/
---
layout: model
title: English image_classifier_vit_base_patch16_224_recylce_ft ViTForImageClassification from NhatPham
author: John Snow Labs
name: image_classifier_vit_base_patch16_224_recylce_ft
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_recylce_ft` is a English model originally trained by NhatPham.
## Predicted Entities
`Non-Recycle `, `Object`, `Recycle`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_recylce_ft_en_4.1.0_3.0_1660167955114.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_recylce_ft_en_4.1.0_3.0_1660167955114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_base_patch16_224_recylce_ft", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_base_patch16_224_recylce_ft", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_base_patch16_224_recylce_ft|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Relation Extraction Model Clinical
author: John Snow Labs
name: re_drug_drug_interaction_clinical
class: RelationExtractionModel
language: en
nav_key: models
repository: clinical/models
date: 2020-09-03
task: Relation Extraction
edition: Healthcare NLP 2.5.5
spark_version: 2.4
tags: [clinical,licensed,relation extraction,en]
supported: true
annotator: RelationExtractionModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Relation Extraction model based on syntactic features using deep learning. This model can be used to identify drug-drug interactions relationships among drug entities.
## Predicted Entities
``DDI-advise``, ``DDI-effect``, ``DDI-mechanism``, ``DDI-int``, ``DDI-false``
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_drug_drug_interaction_clinical_en_2.5.5_2.4_1599156924424.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_drug_drug_interaction_clinical_en_2.5.5_2.4_1599156924424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
In the table below, `re_drug_drug_interaction_clinical` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated.
| RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS |
|:---------------------------------:|-----------------------------------------------------------------------|:------------:|---------------|
| re_drug_drug_interaction_clinical | DDI-advise, DDI-effect, DDI-mechanism, DDI-int, DDI-false | ner_posology | [“drug-drug”] |
{% include programmingLanguageSelectScalaPython.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = NerConverter()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
ddi_re_model = RelationExtractionModel.pretrained("re_drug_drug_interaction_clinical","en","clinical/models")\
.setInputCols("word_embeddings","chunk","pos","dependency")\
.setOutputCol("category")
nlp_pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_converter, dependency_parser, ddi_re_model])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
annotations = light_pipeline.fullAnnotate("""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. If additional adrenergic drugs are to be administered by any route, they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""")
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverter()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
val ddi_re_model = RelationExtractionModel.pretrained("re_drug_drug_interaction_clinical","en","clinical/models")
.setInputCols("word_embeddings","chunk","pos","dependency")
.setOutputCol("category")
val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_converter, dependency_parser, ddi_re_model))
val data = Seq("""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. If additional adrenergic drugs are to be administered by any route, they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
|relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 |entity2_begin | entity2_end | chunk2 |
|DDI-advise | DRUG | 5 | 17 | carbamazepine | DRUG | 62 73 | aripiprazole |
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|-----------------------------------------|
| Name: | re_drug_drug_interaction_clinical |
| Type: | RelationExtractionModel |
| Compatibility: | Spark NLP 2.5.5+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | [word_embeddings, chunk, pos, dependency] |
|Output labels: | [category] |
| Language: | en |
| Case sensitive: | False |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on data gathered and manually annotated by John Snow Labs.
{:.h2_title}
## Benchmarking
```bash
+-------------+------+------+------+
| relation|recall| prec | f1 |
+-------------+------+------+------+
| DDI-effect| 0.76| 0.38 | 0.51 |
| DDI-false| 0.72| 0.97 | 0.83 |
| DDI-advise| 0.74| 0.39 | 0.51 |
+-------------+------+------+------+
```
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from comacrae)
author: John Snow Labs
name: roberta_qa_edav3
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-edav3` is a English model originally trained by `comacrae`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_edav3_en_4.3.0_3.0_1674220150986.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_edav3_en_4.3.0_3.0_1674220150986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_edav3","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_edav3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_edav3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/comacrae/roberta-edav3
---
layout: model
title: Multilingual T5ForConditionalGeneration Base Cased model (from Voicelab)
author: John Snow Labs
name: t5_vlt5_base_keywords
date: 2023-01-31
tags: [en, pl, open_source, t5, xx, tensorflow]
task: Text Generation
language: xx
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vlt5-base-keywords` is a Multilingual model originally trained by `Voicelab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_vlt5_base_keywords_xx_4.3.0_3.0_1675158538277.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_vlt5_base_keywords_xx_4.3.0_3.0_1675158538277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_vlt5_base_keywords","xx") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_vlt5_base_keywords","xx")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_vlt5_base_keywords|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|xx|
|Size:|1.1 GB|
## References
- https://huggingface.co/Voicelab/vlt5-base-keywords
- https://nlp-demo-1.voicelab.ai/
- https://arxiv.org/abs/2209.14008
- https://arxiv.org/abs/2209.14008
- https://voicelab.ai/contact/
---
layout: model
title: Athena Conditions Entity Resolver (Healthcare)
author: John Snow Labs
name: chunkresolve_athena_conditions_healthcare
class: ChunkEntityResolverModel
language: en
nav_key: models
repository: clinical/models
date: 2020-09-16
task: Entity Resolution
edition: Healthcare NLP 2.6.0
spark_version: 2.4
tags: [clinical,licensed,entity_resolution,en]
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance.
## Predicted Entities
Athena Codes and their normalized definition.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_athena_conditions_healthcare_en_2.6.0_2.4_1600265258887.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_athena_conditions_healthcare_en_2.6.0_2.4_1600265258887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
This model requires `embeddings_healthcare_100d` and `ner_healthcare` in the pipeline you use.
{% include programmingLanguageSelectScalaPython.html %}
```python
...
athena_re_model = ChunkEntityResolverModel.pretrained("chunkresolve_athena_conditions_healthcare","en","clinical/models")\
.setInputCols("token","chunk_embeddings")\
.setOutputCol("entity")
pipeline_athena = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter, chunk_embeddings, athena_re_model])
model = pipeline_athena.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text"))
results = model.transform(data)
```
```scala
val athena_re_model = ChunkEntityResolverModel.pretrained("chunkresolve_athena_conditions_healthcare","en","clinical/models")
.setInputCols("token","chunk_embeddings")
.setOutputCol("entity")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter, chunk_embeddings, athena_re_model))
val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
chunk entity athena_description athena_code
0 a cold PROBLEM Intolerant of cold 4213725
1 cough PROBLEM Cough 254761
2 runny nose PROBLEM O/E - nose 4156058
3 fever PROBLEM Fever 437663
4 difficulty breathing PROBLEM Difficulty breathing 4041664
5 her cough PROBLEM Does cough 4122567
6 dry PROBLEM Dry eyes 4036620
7 hacky PROBLEM Resolving infantile idiopathic scoliosis 44833868
8 physical exam TEST Physical angioedema 37110554
9 a right TM PROBLEM Tuberculosis of thyroid gland, unspecified 44819346
10 fairly congested PROBLEM Tonsil congested 4116401
11 Amoxil TREATMENT Amoxycillin overdose 4173544
12 Aldex TREATMENT Oral lesion 43530620
13 difficulty breathing PROBLEM Difficulty breathing 4041664
14 more congested PROBLEM Nasal congestion 4195085
15 a temperature TEST Tolerance of ambient temperature - finding 4271383
16 congestion PROBLEM Nasal congestion 4195085
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|-------------------------------------------|
| Name: | chunkresolve_athena_conditions_healthcare |
| Type: | ChunkEntityResolverModel |
| Compatibility: | 2.6.0 |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | [token, chunk_embeddings] |
|Output labels: | [entity] |
| Language: | en |
| Case sensitive: | True |
| Dependencies: | embeddings_healthcare_100d |
{:.h2_title}
## Data Source
Trained on Athena dataset.
---
layout: model
title: German BertForMaskedLM Base Cased model (from deepset)
author: John Snow Labs
name: bert_embeddings_g_base
date: 2022-12-02
tags: [de, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: de
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gbert-base` is a German model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_base_de_4.2.4_3.0_1670022131926.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_base_de_4.2.4_3.0_1670022131926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_base","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_base","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_g_base|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|412.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/deepset/gbert-base
- https://arxiv.org/pdf/2010.10906.pdf
- https://arxiv.org/pdf/2010.10906.pdf
- https://workablehr.s3.amazonaws.com/uploads/account/logo/476306/logo
- https://deepset.ai/german-bert
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/FARM
- https://github.com/deepset-ai/haystack/
- https://twitter.com/deepset_ai
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community/join
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- http://www.deepset.ai/jobs
---
layout: model
title: Part of Speech for Norwegian Nynorsk
author: John Snow Labs
name: pos_ud_nynorsk
date: 2021-03-09
tags: [part_of_speech, open_source, norwegian_nynorsk, pos_ud_nynorsk, nn]
task: Part of Speech Tagging
language: nn
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- DET
- NOUN
- ADP
- PUNCT
- CCONJ
- PRON
- VERB
- PROPN
- AUX
- ADJ
- ADV
- SCONJ
- PART
- INTJ
- NUM
- X
- SYM
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_nynorsk_nn_3.0.0_3.0_1615292123096.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_nynorsk_nn_3.0.0_3.0_1615292123096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_nynorsk", "nn") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['Hello from John Snow Labs!']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_nynorsk", "nn")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("Hello from John Snow Labs!").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs!""]
token_df = nlu.load('nn.pos.ud_nynorsk').predict(text)
token_df
```
## Results
```bash
token pos
0 Hello PROPN
1 from NOUN
2 John NOUN
3 Snow PROPN
4 Labs PROPN
5 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_nynorsk|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|nn|
---
layout: model
title: Recognize Entities OntoNotes pipeline - BERT Medium
author: John Snow Labs
name: onto_recognize_entities_bert_medium
date: 2021-03-23
tags: [open_source, english, onto_recognize_entities_bert_medium, pipeline, en]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The onto_recognize_entities_bert_medium is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_medium_en_3.0.0_3.0_1616477173790.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_medium_en_3.0.0_3.0_1616477173790.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('onto_recognize_entities_bert_medium', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_medium", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.ner.onto.bert.medium').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------|
| 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[0.0365490540862083,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|onto_recognize_entities_bert_medium|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
---
layout: model
title: Company Name to IRS (Edgar database)
author: John Snow Labs
name: finel_edgar_irs
date: 2022-08-30
tags: [en, finance, companies, edgar, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is an Entity Linking / Entity Resolution model, which allows you to retrieve the IRS number of a company given its name, using SEC Edgar database.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/ER_EDGAR_CRUNCHBASE/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_edgar_irs_en_1.0.0_3.2_1661866402930.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_edgar_irs_en_1.0.0_3.2_1661866402930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+
| chunk| code | all_codes| resolutions | all_distances|
+--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+
| CONTACT GOLD | 981369960| [981369960, 271989147, 208531222, 273566922, 270348508] |[981369960, 271989147, 208531222, 273566922, 270348508] | [0.1733, 0.3700, 0.3867, 0.4103, 0.4121] |
+--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finel_edgar_irs|
|Type:|finance|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[company_irs_number]|
|Language:|en|
|Size:|313.8 MB|
|Case sensitive:|false|
## References
In-house scrapping and postprocessing of SEC Edgar Database
---
layout: model
title: Word2Vec Embeddings in Palatine German (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, pfl, open_source]
task: Embeddings
language: pfl
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pfl_3.4.1_3.0_1647451106726.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pfl_3.4.1_3.0_1647451106726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pfl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pfl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("pfl.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|pfl|
|Size:|92.0 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Legal Method Of Exercise Clause Binary Classifier
author: John Snow Labs
name: legclf_method_of_exercise_clause
date: 2023-01-27
tags: [en, legal, classification, method, exercise, clauses, method_of_exercise, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `method-of-exercise` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`method-of-exercise`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_method_of_exercise_clause_en_1.0.0_3.0_1674821693129.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_method_of_exercise_clause_en_1.0.0_3.0_1674821693129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[method-of-exercise]|
|[other]|
|[other]|
|[method-of-exercise]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_method_of_exercise_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
method-of-exercise 0.97 1.00 0.98 32
other 1.00 0.97 0.99 38
accuracy - - 0.99 70
macro-avg 0.98 0.99 0.99 70
weighted-avg 0.99 0.99 0.99 70
```
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0_en_4.0.0_3.0_1655730608726.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0_en_4.0.0_3.0_1655730608726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_1024d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|439.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-0
---
layout: model
title: Recognize Entities DL Pipeline for Finnish - Medium
author: John Snow Labs
name: entity_recognizer_md
date: 2021-03-22
tags: [open_source, finnish, entity_recognizer_md, pipeline, fi]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: fi
edition: Spark NLP 3.0.0
spark_version: 3.0
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_fi_3.0.0_3.0_1616456428015.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_fi_3.0.0_3.0_1616456428015.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'fi')
annotations = pipeline.fullAnnotate(""Hei John Snow Labs! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "fi")
val result = pipeline.fullAnnotate("Hei John Snow Labs! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hei John Snow Labs! ""]
result_df = nlu.load('fi.ner.md').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:-------------------------|:------------------------|:---------------------------------|:-----------------------------|:---------------------------------|:--------------------|
| 0 | ['Hei John Snow Labs! '] | ['Hei John Snow Labs!'] | ['Hei', 'John', 'Snow', 'Labs!'] | [[0.1868100017309188,.,...]] | ['O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|entity_recognizer_md|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|fi|
---
layout: model
title: Drug Reviews Classifier (BioBERT)
author: John Snow Labs
name: bert_sequence_classifier_drug_reviews_webmd
date: 2022-07-28
tags: [en, clinical, licensed, public_health, classifier, sequence_classification, drug, review]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify drug reviews from WebMD.com
## Predicted Entities
`negative`, `positive`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_CHANGE_DRUG_TREATMENT/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_drug_reviews_webmd_en_4.0.0_3.0_1659008484818.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_drug_reviews_webmd_en_4.0.0_3.0_1659008484818.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_drug_reviews_webmd", "en", "clinical/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame(["While it has worked for me, the sweating and chills especially at night when trying to sleep are very off putting and I am not sure if I will stick with it very much longer. My eyese no longer feel like there is something in them and my mouth is definitely not as dry as before but the side effects are too invasive for my liking.",
"I previously used Cheratussin but was now dispensed Guaifenesin AC as a cheaper alternative. This stuff does n t work as good as Cheratussin and taste like cherry flavored sugar water."], StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("text", "class.result").show(truncate=False)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_drug_reviews_webmd", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val data = Seq(Array("While it has worked for me, the sweating and chills especially at night when trying to sleep are very off putting and I am not sure if I will stick with it very much longer. My eyese no longer feel like there is something in them and my mouth is definitely not as dry as before but the side effects are too invasive for my liking.",
"I previously used Cheratussin but was now dispensed Guaifenesin AC as a cheaper alternative. This stuff does n t work as good as Cheratussin and taste like cherry flavored sugar water.")).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.drug_reviews").predict("""While it has worked for me, the sweating and chills especially at night when trying to sleep are very off putting and I am not sure if I will stick with it very much longer. My eyese no longer feel like there is something in them and my mouth is definitely not as dry as before but the side effects are too invasive for my liking.""")
```
## Results
```bash
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|text |result |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|While it has worked for me, the sweating and chills especially at night when trying to sleep are very off putting and I am not sure if I will stick with it very much longer. My eyese no longer feel like there is something in them and my mouth is definitely not as dry as before but the side effects are too invasive for my liking.|[negative]|
|I previously used Cheratussin but was now dispensed Guaifenesin AC as a cheaper alternative. This stuff does n t work as good as Cheratussin and taste like cherry flavored sugar water . |[positive]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_drug_reviews_webmd|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## Benchmarking
```bash
label precision recall f1-score support
negative 0.8589 0.8234 0.8408 1042
positive 0.8612 0.8901 0.8754 1283
accuracy - - 0.8602 2325
macro-avg 0.8600 0.8568 0.8581 2325
weighted-avg 0.8602 0.8602 0.8599 2325
```
---
layout: model
title: English BertForQuestionAnswering model (from peggyhuang)
author: John Snow Labs
name: bert_qa_nolog_SciBert_v2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `nolog-SciBert-v2` is a English model orginally trained by `peggyhuang`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_nolog_SciBert_v2_en_4.0.0_3.0_1654188991457.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_nolog_SciBert_v2_en_4.0.0_3.0_1654188991457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_nolog_SciBert_v2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_nolog_SciBert_v2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.scibert.v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_nolog_SciBert_v2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|410.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/peggyhuang/nolog-SciBert-v2
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from Filial)
author: John Snow Labs
name: distilbert_qa_filial_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Filial`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_filial_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768555427.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_filial_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768555427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_filial_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_filial_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_filial_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Filial/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Legal Waiver Of Subrogation Clause Binary Classifier
author: John Snow Labs
name: legclf_waiver_of_subrogation_clause
date: 2023-01-29
tags: [en, legal, classification, waiver, subrogation, clauses, waiver_of_subrogation, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `waiver-of-subrogation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`waiver-of-subrogation`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_subrogation_clause_en_1.0.0_3.0_1675005652788.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_subrogation_clause_en_1.0.0_3.0_1675005652788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[waiver-of-subrogation]|
|[other]|
|[other]|
|[waiver-of-subrogation]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_waiver_of_subrogation_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.95 0.97 0.96 39
waiver-of-subrogation 0.97 0.94 0.95 32
accuracy - - 0.96 71
macro-avg 0.96 0.96 0.96 71
weighted-avg 0.96 0.96 0.96 71
```
---
layout: model
title: Spanish BertForMaskedLM Base Uncased model (from dccuchile)
author: John Snow Labs
name: bert_embeddings_base_spanish_wwm_uncased
date: 2022-12-02
tags: [es, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: es
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-uncased` is a Spanish model originally trained by `dccuchile`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_spanish_wwm_uncased_es_4.2.4_3.0_1670018913004.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_spanish_wwm_uncased_es_4.2.4_3.0_1670018913004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_spanish_wwm_uncased","es") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_spanish_wwm_uncased","es")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_spanish_wwm_uncased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|es|
|Size:|412.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased
- https://github.com/google-research/bert
- https://github.com/josecannete/spanish-corpora
- https://github.com/google-research/bert/blob/master/multilingual.md
- https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/tensorflow_weights.tar.gz
- https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/pytorch_weights.tar.gz
- https://users.dcc.uchile.cl/~jperez/beto/cased_2M/tensorflow_weights.tar.gz
- https://users.dcc.uchile.cl/~jperez/beto/cased_2M/pytorch_weights.tar.gz
- https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1827
- https://www.kaggle.com/nltkdata/conll-corpora
- https://github.com/gchaperon/beto-benchmarks/blob/master/conll2002/dev_results_beto-cased_conll2002.txt
- https://github.com/facebookresearch/MLDoc
- https://github.com/gchaperon/beto-benchmarks/blob/master/MLDoc/dev_results_beto-cased_mldoc.txt
- https://github.com/gchaperon/beto-benchmarks/blob/master/MLDoc/dev_results_beto-uncased_mldoc.txt
- https://github.com/google-research-datasets/paws/tree/master/pawsx
- https://github.com/facebookresearch/XNLI
- https://colab.research.google.com/drive/1uRwg4UmPgYIqGYY4gW_Nsw9782GFJbPt
- https://www.adere.so/
- https://imfd.cl/en/
- https://www.tensorflow.org/tfrc
- https://users.dcc.uchile.cl/~jperez/papers/pml4dc2020.pdf
- https://github.com/google-research/bert/blob/master/multilingual.md
- https://arxiv.org/pdf/1904.09077.pdf
- https://arxiv.org/pdf/1906.01502.pdf
- https://arxiv.org/abs/1812.10464
- https://arxiv.org/pdf/1901.07291.pdf
- https://arxiv.org/pdf/1904.02099.pdf
- https://arxiv.org/pdf/1906.01569.pdf
- https://arxiv.org/abs/1908.11828
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from V3RX2000)
author: John Snow Labs
name: xlmroberta_ner_v3rx2000_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `V3RX2000`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_v3rx2000_base_finetuned_panx_de_4.1.0_3.0_1660430662562.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_v3rx2000_base_finetuned_panx_de_4.1.0_3.0_1660430662562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_v3rx2000_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_v3rx2000_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_v3rx2000_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/V3RX2000/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Pipeline to Summarize Clinical Notes in Laymen Terms
author: John Snow Labs
name: summarizer_clinical_laymen_pipeline
date: 2023-06-06
tags: [licensed, en, clinical, summarization, laymen_terms]
task: Summarization
language: en
edition: Healthcare NLP 4.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [summarizer_clinical_laymen](https://nlp.johnsnowlabs.com/2023/05/31/summarizer_clinical_laymen_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_pipeline_en_4.4.1_3.0_1686085843660.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_pipeline_en_4.4.1_3.0_1686085843660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("summarizer_clinical_laymen_pipeline", "en", "clinical/models")
text = """
Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.
PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.
PAST SURGICAL HISTORY: Pertinent for cholecystectomy.
PSYCHOLOGICAL HISTORY: Negative.
SOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.
FAMILY HISTORY: Pertinent for obesity and hypertension.
MEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.
ALLERGIES: She has no known drug allergies.
REVIEW OF SYSTEMS: Negative.
PHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.
ASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.
"""
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("summarizer_clinical_laymen_pipeline", "en", "clinical/models")
val text = """
Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.
PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.
PAST SURGICAL HISTORY: Pertinent for cholecystectomy.
PSYCHOLOGICAL HISTORY: Negative.
SOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.
FAMILY HISTORY: Pertinent for obesity and hypertension.
MEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.
ALLERGIES: She has no known drug allergies.
REVIEW OF SYSTEMS: Negative.
PHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.
ASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.
"""
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
This is a clinical note about a 34-year-old woman who is interested in having weight loss surgery. She has been overweight for over 20 years and wants to have more energy and improve her self-image. She has tried many diets and weight loss programs, but has not been successful in keeping the weight off. She has a history of hypertension and shortness of breath, but is not allergic to any medications. She will have an upper endoscopy and will be contacted by a nutritionist and social worker. The plan is to have her weight loss surgery through the gastric bypass, rather than Lap-Band.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|summarizer_clinical_laymen_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|937.2 MB|
## Included Models
- DocumentAssembler
- MedicalSummarizer
---
layout: model
title: English BertForQuestionAnswering Mini Cased model (from M-FAC)
author: John Snow Labs
name: bert_qa_mini_finetuned_squadv2
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-mini-finetuned-squadv2` is a English model originally trained by `M-FAC`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mini_finetuned_squadv2_en_4.0.0_3.0_1657187810266.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mini_finetuned_squadv2_en_4.0.0_3.0_1657187810266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mini_finetuned_squadv2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mini_finetuned_squadv2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_mini_finetuned_squadv2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|42.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/M-FAC/bert-mini-finetuned-squadv2
- https://arxiv.org/pdf/2107.03356.pdf
- https://github.com/IST-DASLab/M-FAC
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-64-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10_en_4.0.0_3.0_1657185376585.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10_en_4.0.0_3.0_1657185376585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-64-finetuned-squad-seed-10
---
layout: model
title: Financial Question Answering (RoBerta)
author: John Snow Labs
name: finqa_roberta
date: 2022-08-09
tags: [en, finance, qa, licensed]
task: Question Answering
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Financial RoBerta-based Question Answering model, trained on squad-v2, finetuned on proprietary Financial questions and answers.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finqa_roberta_en_1.0.0_3.2_1660054527812.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finqa_roberta_en_1.0.0_3.2_1660054527812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
spanClassifier = nlp.RoBertaForQuestionAnswering.pretrained("finqa_roberta","en", "finance/models") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = nlp.Pipeline(stages=[documentAssembler, spanClassifier])
example = spark.createDataFrame([["What is the current total Operating Profit?", "Operating profit totaled EUR 9.4 mn , down from EUR 11.7 mn in 2004"]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
result.select('answer.result').show()
```
## Results
```bash
`9.4 mn , down from EUR 11.7`
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finqa_roberta|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|248.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
Trained on squad-v2, finetuned on proprietary Financial questions and answers.
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from tyqiangz)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_base_finetuned_chaii
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-chaii` is a English model originally trained by `tyqiangz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_finetuned_chaii_en_4.0.0_3.0_1655989602942.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_finetuned_chaii_en_4.0.0_3.0_1655989602942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_finetuned_chaii","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_base_finetuned_chaii","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.chaii.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_base_finetuned_chaii|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|861.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/tyqiangz/xlm-roberta-base-finetuned-chaii
---
layout: model
title: Dutch Named Entity Recognition (from Davlan)
author: John Snow Labs
name: xlmroberta_ner_xlm_roberta_base_ner_hrl
date: 2022-05-17
tags: [xlm_roberta, ner, token_classification, nl, open_source]
task: Named Entity Recognition
language: nl
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-ner-hrl` is a Dutch model orginally trained by `Davlan`.
## Predicted Entities
`PER`, `ORG`, `LOC`, `DATE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_ner_hrl_nl_3.4.2_3.0_1652809603204.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_ner_hrl_nl_3.4.2_3.0_1652809603204.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_ner_hrl","nl") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Ik hou van Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_ner_hrl","nl")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Ik hou van Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_xlm_roberta_base_ner_hrl|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|nl|
|Size:|855.9 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl
- https://camel.abudhabi.nyu.edu/anercorp/
- https://www.clips.uantwerpen.be/conll2003/ner/
- https://www.clips.uantwerpen.be/conll2003/ner/
- https://www.clips.uantwerpen.be/conll2002/ner/
- https:
---
layout: model
title: Recognize Entities DL pipeline for English - BERT
author: John Snow Labs
name: recognize_entities_bert
date: 2021-03-23
tags: [open_source, english, recognize_entities_bert, pipeline, en]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The recognize_entities_bert is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/recognize_entities_bert_en_3.0.0_3.0_1616473903583.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/recognize_entities_bert_en_3.0.0_3.0_1616473903583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('recognize_entities_bert', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("recognize_entities_bert", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.ner.bert').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------|
| 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[-0.085488274693489,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-ORG', 'O'] | ['John Snow Labs'] || | document | sentence | token | embeddings | ner | entities |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|recognize_entities_bert|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
---
layout: model
title: Fast Neural Machine Translation Model from Kirundi to English
author: John Snow Labs
name: opus_mt_rn_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, rn, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `rn`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_rn_en_xx_2.7.0_2.4_1609167024774.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_rn_en_xx_2.7.0_2.4_1609167024774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_rn_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_rn_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.rn.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_rn_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from juliusco)
author: John Snow Labs
name: distilbert_qa_juliusco_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `juliusco`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_juliusco_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771577414.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_juliusco_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771577414.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_juliusco_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_juliusco_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_juliusco_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/juliusco/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265902` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902_en_4.0.0_3.0_1655984889587.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902_en_4.0.0_3.0_1655984889587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265902").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|888.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265902
---
layout: model
title: Detect Assertion Status (assertion_wip_large)
author: John Snow Labs
name: jsl_assertion_wip_large
date: 2021-01-18
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 2.7.0
spark_version: 2.4
tags: [clinical, licensed, assertion, en, ner]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The deep neural network architecture for assertion status detection in Spark NLP is based on a BiLSTM framework, and is a modified version of the architecture proposed by Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances, hypothetically present in the patient at some future point, and mentioned in the patient report but associated with someone- else (Uzuner et al. 2011). Apart from what we released in other assertion models, an in-house annotations on a curated dataset (6K clinical notes) is used to augment the base assertion dataset (2010 i2b2/VA).
{:.h2_title}
## Predicted Entities
`present`, `absent`, `possible`, `planned`, `someoneelse`, `past`.
{:.btn-box}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_assertion_wip_large_en_2.6.5_2.4_1609091911183.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_assertion_wip_large_en_2.6.5_2.4_1609091911183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, NerConverter, AssertionDLModel.
{% include programmingLanguageSelectScalaPython.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = sparknlp.annotators.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
clinical_assertion = AssertionDLModel.pretrained("jsl_assertion_wip_large", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion])
light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val clinical_assertion = AssertionDLModel.pretrained("jsl_assertion_wip_large", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion))
val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and an ``"assertion"`` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select ``"ner_chunk.result"`` and ``"assertion.result"`` from your output dataframe.
```bash
+-----------------------------------------+-----+---+----------------------------+-------+-----------+
|chunk |begin|end|ner_label |sent_id|assertion |
+-----------------------------------------+-----+---+----------------------------+-------+-----------+
|21-day-old |17 |26 |Age |0 |present |
|Caucasian |28 |36 |Race_Ethnicity |0 |present |
|male |38 |41 |Gender |0 |someoneelse|
|for 2 days |48 |57 |Duration |0 |present |
|congestion |62 |71 |Symptom |0 |present |
|mom |75 |77 |Gender |0 |someoneelse|
|yellow |99 |104|Modifier |0 |present |
|discharge |106 |114|Symptom |0 |present |
|nares |135 |139|External_body_part_or_region|0 |someoneelse|
|she |147 |149|Gender |0 |present |
|mild |168 |171|Modifier |0 |present |
|problems with his breathing while feeding|173 |213|Symptom |0 |present |
|perioral cyanosis |237 |253|Symptom |0 |absent |
|retractions |258 |268|Symptom |0 |absent |
|One day ago |272 |282|RelativeDate |1 |someoneelse|
|mom |285 |287|Gender |1 |someoneelse|
|Tylenol |345 |351|Drug_BrandName |1 |someoneelse|
|Baby |354 |357|Age |2 |someoneelse|
|decreased p.o. intake |377 |397|Symptom |2 |someoneelse|
|His |400 |402|Gender |3 |someoneelse|
+-----------------------------------------+-----+---+----------------------------+-------+-----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|jsl_assertion_wip_large|
|Type:|ner|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence, ner_chunk, embeddings]|
|Output Labels:|[assertion]|
|Language:|[en]|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with 'embeddings_clinical'.
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
{:.h2_title}
## Benchmarking
```bash
label prec rec f1
absent 0.957 0.949 0.953
someoneelse 0.958 0.936 0.947
planned 0.766 0.657 0.707
possible 0.852 0.884 0.868
past 0.894 0.890 0.892
present 0.902 0.917 0.910
Macro-average 0.888 0.872 0.880
Micro-average 0.908 0.908 0.908
```
---
layout: model
title: English T5ForConditionalGeneration Large Cased model (from google)
author: John Snow Labs
name: t5_efficient_large_nl10
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-nl10` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl10_en_4.3.0_3.0_1675116839759.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl10_en_4.3.0_3.0_1675116839759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_large_nl10","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_large_nl10","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_large_nl10|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|689.4 MB|
## References
- https://huggingface.co/google/t5-efficient-large-nl10
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Albanian BertForQuestionAnswering model (from vanadhi)
author: John Snow Labs
name: bert_qa_bert_base_uncased_fiqa_flm_sq_flit
date: 2022-06-02
tags: [open_source, question_answering, bert]
task: Question Answering
language: sq
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-fiqa-flm-sq-flit` is a Albanian model orginally trained by `vanadhi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_fiqa_flm_sq_flit_sq_4.0.0_3.0_1654181262795.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_fiqa_flm_sq_flit_sq_4.0.0_3.0_1654181262795.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_fiqa_flm_sq_flit","sq") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_uncased_fiqa_flm_sq_flit","sq")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("sq.answer_question.bert.base_uncased.by_vanadhi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_uncased_fiqa_flm_sq_flit|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|sq|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/vanadhi/bert-base-uncased-fiqa-flm-sq-flit
- https://drive.google.com/file/d/1BlWaV-qVPfpGyJoWQJU9bXQgWCATgxEP/view
---
layout: model
title: Spanish T5ForConditionalGeneration Cased model (from JorgeSarry)
author: John Snow Labs
name: t5_est5base
date: 2023-01-30
tags: [es, open_source, t5, tensorflow]
task: Text Generation
language: es
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `est5base` is a Spanish model originally trained by `JorgeSarry`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_est5base_es_4.3.0_3.0_1675101719578.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_est5base_es_4.3.0_3.0_1675101719578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_est5base","es") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_est5base","es")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_est5base|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|es|
|Size:|511.9 MB|
## References
- https://huggingface.co/JorgeSarry/est5base
- https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90
---
layout: model
title: English BertForQuestionAnswering Large Cased model (from Slavka)
author: John Snow Labs
name: bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-log-parser-winlogbeat_nowhitespace_large` is a English model originally trained by `Slavka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg_en_4.0.0_3.0_1657182801327.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg_en_4.0.0_3.0_1657182801327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Slavka/bert-base-cased-finetuned-log-parser-winlogbeat_nowhitespace_large
---
layout: model
title: Japanese BertForMaskedLM Base Cased model (from cl-tohoku)
author: John Snow Labs
name: bert_embeddings_base_japanese_char_whole_word_masking
date: 2022-12-02
tags: [ja, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ja
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-japanese-char-whole-word-masking` is a Japanese model originally trained by `cl-tohoku`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_char_whole_word_masking_ja_4.2.4_3.0_1670018214237.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_char_whole_word_masking_ja_4.2.4_3.0_1670018214237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_char_whole_word_masking","ja") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_char_whole_word_masking","ja")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_japanese_char_whole_word_masking|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ja|
|Size:|334.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking
- https://github.com/google-research/bert
- https://github.com/cl-tohoku/bert-japanese/tree/v1.0
- https://github.com/attardi/wikiextractor
- https://taku910.github.io/mecab/
- https://creativecommons.org/licenses/by-sa/3.0/
- https://www.tensorflow.org/tfrc/
---
layout: model
title: English image_classifier_vit_my_bean_VIT ViTForImageClassification from woojinSong
author: John Snow Labs
name: image_classifier_vit_my_bean_VIT
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_my_bean_VIT` is a English model originally trained by woojinSong.
## Predicted Entities
`angular_leaf_spot`, `bean_rust`, `healthy`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_my_bean_VIT_en_4.1.0_3.0_1660167368304.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_my_bean_VIT_en_4.1.0_3.0_1660167368304.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_my_bean_VIT", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_my_bean_VIT", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_my_bean_VIT|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Word2Vec Embeddings in Sinhala (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, si, open_source]
task: Embeddings
language: si
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_si_3.4.1_3.0_1647457358445.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_si_3.4.1_3.0_1647457358445.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","si") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["මම ස්පර්ක් එන්එල්පී වලට කැමතියි"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","si")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("මම ස්පර්ක් එන්එල්පී වලට කැමතියි").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("si.embed.w2v_cc_300d").predict("""මම ස්පර්ක් එන්එල්පී වලට කැමතියි""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|si|
|Size:|471.2 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18)
author: John Snow Labs
name: roberta_qa_base_bne_becas
date: 2023-01-20
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-ROBERTaBECAS` is a Spanish model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_bne_becas_es_4.3.0_3.0_1674212894934.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_bne_becas_es_4.3.0_3.0_1674212894934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_bne_becas","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_bne_becas","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_bne_becas|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|420.6 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Evelyn18/roberta-base-bne-ROBERTaBECAS
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_by_doddle124578 TFWav2Vec2ForCTC from doddle124578
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_doddle124578` is a English model originally trained by doddle124578.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578_en_4.2.0_3.0_1664036115160.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578_en_4.2.0_3.0_1664036115160.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Extract Granular Anatomical Entities from Oncology Texts
author: John Snow Labs
name: ner_oncology_anatomy_granular_wip
date: 2022-10-01
tags: [licensed, clinical, oncology, en, ner, anatomy]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extractions mentions of anatomical entities using granular labels.
Definitions of Predicted Entities:
- `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower".
- `Site_Bone`: Anatomical terms that refer to the human skeleton.
- `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum).
- `Site_Breast`: Anatomical terms that refer to the breasts.
- `Site_Liver`: Anatomical terms that refer to the liver.
- `Site_Lung`: Anatomical terms that refer to the lungs.
- `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies.
- `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities.
## Predicted Entities
`Direction`, `Site_Bone`, `Site_Brain`, `Site_Breast`, `Site_Liver`, `Site_Lung`, `Site_Lymph_Node`, `Site_Other_Body_Part`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_wip_en_4.0.0_3.0_1664584284877.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_wip_en_4.0.0_3.0_1664584284877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_anatomy_granular_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_anatomy_granular_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_anatomy_granular_wip").predict("""The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.""")
```
## Results
```bash
| chunk | ner_label |
|:--------|:------------|
| left | Direction |
| breast | Site_Breast |
| lungs | Site_Lung |
| liver | Site_Liver |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_anatomy_granular_wip|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|859.9 KB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Direction 601.0 150.0 133.0 734.0 0.80 0.82 0.81
Site_Lymph_Node 415.0 31.0 51.0 466.0 0.93 0.89 0.91
Site_Breast 98.0 6.0 20.0 118.0 0.94 0.83 0.88
Site_Other_Body_Part 713.0 277.0 388.0 1101.0 0.72 0.65 0.68
Site_Bone 176.0 30.0 56.0 232.0 0.85 0.76 0.80
Site_Liver 134.0 77.0 36.0 170.0 0.64 0.79 0.70
Site_Lung 337.0 70.0 106.0 443.0 0.83 0.76 0.79
Site_Brain 164.0 58.0 36.0 200.0 0.74 0.82 0.78
macro_avg 2638.0 699.0 826.0 3464.0 0.81 0.79 0.80
micro_avg NaN NaN NaN NaN 0.79 0.76 0.78
```
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from edwardjross)
author: John Snow Labs
name: xlmroberta_ner_edwardjross_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `edwardjross`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_edwardjross_base_finetuned_panx_de_4.1.0_3.0_1660432506605.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_edwardjross_base_finetuned_panx_de_4.1.0_3.0_1660432506605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_edwardjross_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_edwardjross_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_edwardjross_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/edwardjross/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Legal Name Clause Binary Classifier
author: John Snow Labs
name: legclf_name_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `name` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `name`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_name_clause_en_1.0.0_3.2_1660122668905.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_name_clause_en_1.0.0_3.2_1660122668905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[name]|
|[other]|
|[other]|
|[name]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_name_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.1 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
name 0.98 0.91 0.94 46
other 0.97 0.99 0.98 148
accuracy - - 0.97 194
macro-avg 0.98 0.95 0.96 194
weighted-avg 0.97 0.97 0.97 194
```
---
layout: model
title: Spanish Deberta Embeddings model (from plncmm)
author: John Snow Labs
name: deberta_embeddings_cowese_base
date: 2023-03-12
tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, es, tensorflow]
task: Embeddings
language: es
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DeBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mdeberta-cowese-base-es` is a Spanish model originally trained by `plncmm`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_cowese_base_es_4.3.1_3.0_1678657528702.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_cowese_base_es_4.3.1_3.0_1678657528702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_cowese_base","es") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_cowese_base","es")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|deberta_embeddings_cowese_base|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|1.0 GB|
|Case sensitive:|false|
## References
https://huggingface.co/plncmm/mdeberta-cowese-base-es
---
layout: model
title: Fast Neural Machine Translation Model from English to Ga
author: John Snow Labs
name: opus_mt_en_gaa
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, gaa, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `gaa`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_gaa_xx_2.7.0_2.4_1609169994333.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_gaa_xx_2.7.0_2.4_1609169994333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_gaa", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_gaa", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.gaa').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_gaa|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English XLMRobertaForTokenClassification Base Cased model (from tner)
author: John Snow Labs
name: xlmroberta_ner_base_bionlp2004
date: 2022-08-13
tags: [en, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-bionlp2004` is a English model originally trained by `tner`.
## Predicted Entities
`protein`, `dna`, `cell line`, `rna`, `cell type`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_bionlp2004_en_4.1.0_3.0_1660426076485.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_bionlp2004_en_4.1.0_3.0_1660426076485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_bionlp2004","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_bionlp2004","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_bionlp2004|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|783.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/tner/xlm-roberta-base-bionlp2004
- https://github.com/asahi417/tner
---
layout: model
title: Detect Problems, Tests and Treatments (ner_clinical_large)
author: John Snow Labs
name: ner_clinical_large
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
## Predicted Entities
`PROBLEM`, `TEST`, `TREATMENT`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_en_3.0.0_3.0_1617206114650.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_en_3.0.0_3.0_1617206114650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.']], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+-----------------------------------------------------------+---------+
|chunk |ner_label|
+-----------------------------------------------------------+---------+
|the G-protein-activated inwardly rectifying potassium (GIRK|TREATMENT|
|the genomicorganization |TREATMENT|
|a candidate gene forType II diabetes mellitus |PROBLEM |
|byapproximately |TREATMENT|
|single nucleotide polymorphisms |TREATMENT|
|aVal366Ala substitution |TREATMENT|
|an 8 base-pair |TREATMENT|
|insertion/deletion |PROBLEM |
|Ourexpression studies |TEST |
|the transcript in various humantissues |PROBLEM |
|fat andskeletal muscle |PROBLEM |
|furtherstudies |PROBLEM |
|the KCNJ9 protein |TREATMENT|
|evaluation |TEST |
|Type II diabetes |PROBLEM |
|the treatment |TREATMENT|
|breast cancer |PROBLEM |
|the standard therapy |TREATMENT|
|anthracyclines |TREATMENT|
|taxanes |TREATMENT|
+-----------------------------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_clinical_large|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
Trained on augmented 2010 i2b2 challenge data with 'embeddings_clinical'.
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|--------------:|------:|------:|------:|---------:|---------:|---------:|
| 0 | I-TREATMENT | 6625 | 1187 | 1329 | 0.848054 | 0.832914 | 0.840416 |
| 1 | I-PROBLEM | 15142 | 1976 | 2542 | 0.884566 | 0.856254 | 0.87018 |
| 2 | B-PROBLEM | 11005 | 1065 | 1587 | 0.911765 | 0.873968 | 0.892466 |
| 3 | I-TEST | 6748 | 923 | 1264 | 0.879677 | 0.842237 | 0.86055 |
| 4 | B-TEST | 8196 | 942 | 1029 | 0.896914 | 0.888455 | 0.892665 |
| 5 | B-TREATMENT | 8271 | 1265 | 1073 | 0.867345 | 0.885167 | 0.876165 |
| 6 | Macro-average | 55987 | 7358 | 8824 | 0.881387 | 0.863166 | 0.872181 |
| 7 | Micro-average | 55987 | 7358 | 8824 | 0.883842 | 0.86385 | 0.873732 |
```
---
layout: model
title: Legal Undertaking for costs Clause Binary Classifier
author: John Snow Labs
name: legclf_undertaking_for_costs_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `undertaking-for-costs` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `undertaking-for-costs`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_undertaking_for_costs_clause_en_1.0.0_3.2_1660123148146.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_undertaking_for_costs_clause_en_1.0.0_3.2_1660123148146.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[undertaking-for-costs]|
|[other]|
|[other]|
|[undertaking-for-costs]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_undertaking_for_costs_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.98 0.98 0.98 44
undertaking-for-costs 0.91 0.91 0.91 11
accuracy - - 0.96 55
macro-avg 0.94 0.94 0.94 55
weighted-avg 0.96 0.96 0.96 55
```
---
layout: model
title: French CamemBert Embeddings (from aliasdasd)
author: John Snow Labs
name: camembert_embeddings_aliasdasd_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `aliasdasd`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_aliasdasd_generic_model_fr_3.4.4_3.0_1653987389063.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_aliasdasd_generic_model_fr_3.4.4_3.0_1653987389063.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_aliasdasd_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_aliasdasd_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_aliasdasd_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/aliasdasd/dummy-model
---
layout: model
title: Pipeline to Summarize Radiology Reports
author: John Snow Labs
name: summarizer_radiology_pipeline
date: 2023-05-29
tags: [licensed, en, clinical, summarization, radiology]
task: Summarization
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [summarizer_radiology](https://nlp.johnsnowlabs.com/2023/04/23/summarizer_jsl_radiology_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_radiology_pipeline_en_4.4.2_3.0_1685401622765.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_radiology_pipeline_en_4.4.2_3.0_1685401622765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("summarizer_radiology_pipeline", "en", "clinical/models")
text = """INDICATIONS: Peripheral vascular disease with claudication.
RIGHT:
1. Normal arterial imaging of right lower extremity.
2. Peak systolic velocity is normal.
3. Arterial waveform is triphasic.
4. Ankle brachial index is 0.96.
LEFT:
1. Normal arterial imaging of left lower extremity.
2. Peak systolic velocity is normal.
3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic.
4. Ankle brachial index is 1.06.
IMPRESSION:
Normal arterial imaging of both lower lobes.
"""
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("summarizer_radiology_pipeline", "en", "clinical/models")
val text = """INDICATIONS: Peripheral vascular disease with claudication.
RIGHT:
1. Normal arterial imaging of right lower extremity.
2. Peak systolic velocity is normal.
3. Arterial waveform is triphasic.
4. Ankle brachial index is 0.96.
LEFT:
1. Normal arterial imaging of left lower extremity.
2. Peak systolic velocity is normal.
3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic.
4. Ankle brachial index is 1.06.
IMPRESSION:
Normal arterial imaging of both lower lobes.
"""
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
The patient has peripheral vascular disease with claudication. The right lower extremity shows normal arterial imaging, but the peak systolic velocity is normal. The arterial waveform is triphasic throughout, except for the posterior tibial artery, which is biphasic. The ankle brachial index is 0.96. The impression is normal arterial imaging of both lower lobes.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|summarizer_radiology_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|937.2 MB|
## Included Models
- DocumentAssembler
- MedicalSummarizer
---
layout: model
title: Multilingual XLMRobertaForTokenClassification Large Cased model (from oliverguhr)
author: John Snow Labs
name: xlmroberta_ner_fullstop_punctuation_multilang_larg
date: 2022-08-01
tags: [xx, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fullstop-punctuation-multilang-large` is a Multilingual model originally trained by `oliverguhr`.
## Predicted Entities
`?`, `:`, `,`, `-`, `0`, `.`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_fullstop_punctuation_multilang_larg_xx_4.1.0_3.0_1659353733692.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_fullstop_punctuation_multilang_larg_xx_4.1.0_3.0_1659353733692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_fullstop_punctuation_multilang_larg","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_fullstop_punctuation_multilang_larg","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_fullstop_punctuation_multilang_larg|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|1.8 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large
---
layout: model
title: French CamemBert Embeddings (from wangst)
author: John Snow Labs
name: camembert_embeddings_wangst_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `wangst`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_wangst_generic_model_fr_3.4.4_3.0_1653990612908.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_wangst_generic_model_fr_3.4.4_3.0_1653990612908.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_wangst_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_wangst_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_wangst_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/wangst/dummy-model
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6_en_4.0.0_3.0_1655733114451.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6_en_4.0.0_3.0_1655733114451.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_512d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|432.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-6
---
layout: model
title: English BertForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4_en_4.0.0_3.0_1657192672919.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4_en_4.0.0_3.0_1657192672919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|387.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-4
---
layout: model
title: Longformer Large NER Pipeline
author: John Snow Labs
name: longformer_large_token_classifier_conll03_pipeline
date: 2022-06-19
tags: [open_source, ner, token_classifier, longformer, conll, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [longformer_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/09/longformer_large_token_classifier_conll03_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653984745.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653984745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("longformer_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I am working at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("longformer_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I am working at John Snow Labs.")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PER |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|longformer_large_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.5 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- LongformerForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Spanish RobertaForQuestionAnswering (from stevemobs)
author: John Snow Labs
name: roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs
date: 2022-06-21
tags: [es, open_source, question_answering, roberta]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-fine-tuned-squad-es` is a English model originally trained by `stevemobs`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs_es_4.0.0_3.0_1655791128840.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs_es_4.0.0_3.0_1655791128840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squad.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|es|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/stevemobs/roberta-large-fine-tuned-squad-es
---
layout: model
title: English asr_wav2vec2_large_xlsr_moroccan TFWav2Vec2ForCTC from othrif
author: John Snow Labs
name: asr_wav2vec2_large_xlsr_moroccan
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_moroccan` is a English model originally trained by othrif.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_moroccan_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_moroccan_en_4.2.0_3.0_1664097975990.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_moroccan_en_4.2.0_3.0_1664097975990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xlsr_moroccan", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xlsr_moroccan", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xlsr_moroccan|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Fast Neural Machine Translation Model from Umbundu to English
author: John Snow Labs
name: opus_mt_umb_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, umb, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `umb`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_umb_en_xx_2.7.0_2.4_1609169540084.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_umb_en_xx_2.7.0_2.4_1609169540084.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_umb_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_umb_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.umb.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_umb_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Fast Neural Machine Translation Model from Isoko to English
author: John Snow Labs
name: opus_mt_iso_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, iso, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `iso`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_iso_en_xx_2.7.0_2.4_1609169073963.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_iso_en_xx_2.7.0_2.4_1609169073963.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_iso_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_iso_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.iso.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_iso_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_models_6 TFWav2Vec2ForCTC from niclas
author: John Snow Labs
name: pipeline_asr_models_6
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_models_6` is a English model originally trained by niclas.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_models_6_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_models_6_en_4.2.0_3.0_1664098783377.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_models_6_en_4.2.0_3.0_1664098783377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_models_6', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_models_6", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_models_6|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English image_classifier_vit_occupation_prediction ViTForImageClassification from darshanz
author: John Snow Labs
name: image_classifier_vit_occupation_prediction
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_occupation_prediction` is a English model originally trained by darshanz.
## Predicted Entities
`anchor`, `professor`, `doctor`, `farmer`, `athlete`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_occupation_prediction_en_4.1.0_3.0_1660168721579.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_occupation_prediction_en_4.1.0_3.0_1660168721579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_occupation_prediction", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_occupation_prediction", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_occupation_prediction|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from comacrae)
author: John Snow Labs
name: roberta_qa_unaugv3
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-unaugv3` is a English model originally trained by `comacrae`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unaugv3_en_4.3.0_3.0_1674222699716.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unaugv3_en_4.3.0_3.0_1674222699716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unaugv3","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unaugv3","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_unaugv3|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/comacrae/roberta-unaugv3
---
layout: model
title: English BertForQuestionAnswering model (from xraychen)
author: John Snow Labs
name: bert_qa_squad_baseline
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-baseline` is a English model orginally trained by `xraychen`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_baseline_en_4.0.0_3.0_1654191934724.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_baseline_en_4.0.0_3.0_1654191934724.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_baseline","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_squad_baseline","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base.by_xraychen").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_squad_baseline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/xraychen/squad-baseline
---
layout: model
title: English asr_wav2vec2_base_rj_try_5 TFWav2Vec2ForCTC from rjrohit
author: John Snow Labs
name: asr_wav2vec2_base_rj_try_5
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_rj_try_5` is a English model originally trained by rjrohit.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_rj_try_5_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_rj_try_5_en_4.2.0_3.0_1664102610582.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_rj_try_5_en_4.2.0_3.0_1664102610582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_rj_try_5", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_rj_try_5", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_rj_try_5|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|355.0 MB|
---
layout: model
title: ICD10CM Musculoskeletal Entity Resolver
author: John Snow Labs
name: chunkresolve_icd10cm_musculoskeletal_clinical
class: ChunkEntityResolverModel
language: en
nav_key: models
repository: clinical/models
date: 2020-04-28
task: Entity Resolution
edition: Healthcare NLP 2.4.5
spark_version: 2.4
tags: [clinical,licensed,entity_resolution,en]
deprecated: true
annotator: ChunkEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance
## Predicted Entities
ICD10-CM Codes and their normalized definition with `clinical_embeddings`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange.button-orange-trans.arr.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_musculoskeletal_clinical_en_2.4.5_2.4_1588103998999.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_musculoskeletal_clinical_en_2.4.5_2.4_1588103998999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
...
muscu_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_musculoskeletal_clinical","en","clinical/models")\
.setInputCols("token","chunk_embeddings")\
.setOutputCol("entity")
pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, muscu_resolver])
model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text"))
results = model.transform(data)
```
```scala
...
val muscu_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_musculoskeletal_clinical","en","clinical/models")
.setInputCols(Array("token","chunk_embeddings"))
.setOutputCol("resolution")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, muscu_resolver))
val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
chunk entity icd10_muscu_description icd10_muscu_code
0 a cold, cough PROBLEM Postprocedural hemorrhage of a musculoskeletal... M96831
1 runny nose PROBLEM Acquired deformity of nose M950
2 fever PROBLEM Periodic fever syndromes M041
3 difficulty breathing PROBLEM Other dentofacial functional abnormalities M2659
4 her cough PROBLEM Cervicalgia M542
5 physical exam TEST Pathological fracture, unspecified toe(s), seq... M84479S
6 fairly congested PROBLEM Synovial hypertrophy, not elsewhere classified... M67262
7 Amoxil TREATMENT Torticollis M436
8 Aldex TREATMENT Other soft tissue disorders related to use, ov... M7088
9 difficulty breathing PROBLEM Other dentofacial functional abnormalities M2659
10 more congested PROBLEM Pain in unspecified ankle and joints of unspec... M25579
11 trouble sleeping PROBLEM Low back pain M545
12 congestion PROBLEM Progressive systemic sclerosis M340
```
{:.model-param}
## Model Information
{:.table-model}
|----------------|-----------------------------------------------|
| Name: | chunkresolve_icd10cm_musculoskeletal_clinical |
| Type: | ChunkEntityResolverModel |
| Compatibility: | Spark NLP 2.4.5+ |
| License: | Licensed |
|Edition:|Official| |
|Input labels: | [token, chunk_embeddings] |
|Output labels: | [entity] |
| Language: | en |
| Case sensitive: | True |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on ICD10CM Dataset Range: M0000-M9979XXS
https://www.icd10data.com/ICD10CM/Codes/M00-M99
---
layout: model
title: Fast Neural Machine Translation Model from French-Based Creoles And Pidgins to English
author: John Snow Labs
name: opus_mt_cpf_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, cpf, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `cpf`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_cpf_en_xx_2.7.0_2.4_1609168557473.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_cpf_en_xx_2.7.0_2.4_1609168557473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_cpf_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_cpf_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.cpf.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_cpf_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering (from huxxx657)
author: John Snow Labs
name: roberta_qa_huxxx657_roberta_base_finetuned_squad
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `huxxx657`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_huxxx657_roberta_base_finetuned_squad_en_4.0.0_3.0_1655734309712.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_huxxx657_roberta_base_finetuned_squad_en_4.0.0_3.0_1655734309712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_huxxx657_roberta_base_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_huxxx657_roberta_base_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_huxxx657_roberta_base_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/huxxx657/roberta-base-finetuned-squad
---
layout: model
title: Extract Clinical Problem Entities (low granularity) from Voice of the Patient Documents (embeddings_clinical_medium)
author: John Snow Labs
name: ner_vop_problem_reduced_emb_clinical_medium
date: 2023-06-07
tags: [licensed, clinical, ner, en, vop, problem]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts clinical problems from the documents transferred from the patient’s own sentences. The taxonomy is reduced (one label for all clinical problems).
## Predicted Entities
`Problem`, `HealthStatus`, `Modifier`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_emb_clinical_medium_en_4.4.3_3.0_1686148297394.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_emb_clinical_medium_en_4.4.3_3.0_1686148297394.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_problem_reduced_emb_clinical_medium", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_problem_reduced_emb_clinical_medium", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:---------------------|:------------|
| pain | Problem |
| fatigue | Problem |
| rheumatoid arthritis | Problem |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_problem_reduced_emb_clinical_medium|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.8 MB|
|Dependencies:|embeddings_clinical_medium|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Problem 6190 1000 1020 7210 0.86 0.86 0.86
HealthStatus 92 32 15 107 0.74 0.86 0.80
Modifier 819 221 320 1139 0.79 0.72 0.75
macro_avg 7101 1253 1355 8456 0.80 0.81 0.80
micro_avg 7101 1253 1355 8456 0.85 0.84 0.84
```
---
layout: model
title: Pipeline to Detect Living Species(roberta_embeddings_BR_BERTo)
author: John Snow Labs
name: ner_living_species_roberta_pipeline
date: 2023-03-13
tags: [pt, ner, clinical, licensed, roberta]
task: Named Entity Recognition
language: pt
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_living_species_roberta](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_roberta_pt_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_pipeline_pt_4.3.0_3.2_1678732150750.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_pipeline_pt_4.3.0_3.2_1678732150750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_living_species_roberta_pipeline", "pt", "clinical/models")
text = '''Mulher de 23 anos, de Capinota, Cochabamba, Bolívia. Ela está no nosso país há quatro anos. Frequentou o departamento de emergência obstétrica onde foi encontrada grávida de 37 semanas, com um colo dilatado de 5 cm e membranas rompidas. O obstetra de emergência realizou um teste de estreptococos negativo e solicitou um hemograma, glucose, bioquímica básica, HBV, HCV e serologia da sífilis.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_living_species_roberta_pipeline", "pt", "clinical/models")
val text = "Mulher de 23 anos, de Capinota, Cochabamba, Bolívia. Ela está no nosso país há quatro anos. Frequentou o departamento de emergência obstétrica onde foi encontrada grávida de 37 semanas, com um colo dilatado de 5 cm e membranas rompidas. O obstetra de emergência realizou um teste de estreptococos negativo e solicitou um hemograma, glucose, bioquímica básica, HBV, HCV e serologia da sífilis."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunks | begin | end | ner_label | confidence |
|---:|:--------------|--------:|------:|:------------|-------------:|
| 0 | Mulher | 0 | 5 | HUMAN | 0.9975 |
| 1 | país | 71 | 74 | HUMAN | 0.8869 |
| 2 | grávida | 163 | 169 | HUMAN | 0.9702 |
| 3 | estreptococos | 283 | 295 | SPECIES | 0.9211 |
| 4 | HBV | 360 | 362 | SPECIES | 0.9911 |
| 5 | HCV | 365 | 367 | SPECIES | 0.9858 |
| 6 | sífilis | 384 | 390 | SPECIES | 0.8898 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_living_species_roberta_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|pt|
|Size:|654.0 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- RoBertaEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Detect Clinical Entities (ner_jsl)
author: John Snow Labs
name: ner_jsl
date: 2021-06-24
tags: [ner, licensed, en, clinical]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.1.0
spark_version: 2.4
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is the official version of jsl_ner_wip_clinical model.
Definitions of Predicted Entities:
- `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else.
- `Direction`: All the information relating to the laterality of the internal and external organs.
- `Test`: Mentions of laboratory, pathology, and radiological tests.
- `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient.
- `Death_Entity`: Mentions that indicate the death of a patient.
- `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced).
- `Duration`: The duration of a medical treatment or medication use.
- `Respiration`: Number of breaths per minute.
- `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims.
- `Birth_Entity`: Mentions that indicate giving birth.
- `Age`: All mention of ages, past or present, related to the patient or with anybody else.
- `Labour_Delivery`: Extractions include stages of labor and delivery.
- `Family_History_Header`: identifies section headers that correspond to Family History of the patient.
- `BMI`: Numeric values and other text information related to Body Mass Index.
- `Temperature`: All mentions that refer to body temperature.
- `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else.
- `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic").
- `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else.
- `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient.
- `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events.
- `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP).
- `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements.
- `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else.
- `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases.
- `Employment`: All mentions of patient or provider occupational titles and employment status .
- `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels).
- `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.).
- `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.).
- `ImagingFindings`: All mentions of radiographic and imagistic findings.
- `Procedure`: All mentions of invasive medical or surgical procedures or treatments.
- `Medical_Device`: All mentions related to medical devices and supplies.
- `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups.
- `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels).
- `Symptom`: All the symptoms mentioned in the document, of a patient or someone else.
- `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure").
- `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs).
- `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Drug_Ingredient`: Active ingredient/s found in drug products.
- `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted.
- `Diet`: All mentions and information regarding patients dietary habits.
- `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye.
- `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein).
- `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs.
- `Allergen`: Allergen related extractions mentioned in the document.
- `EKG_Findings`: All mentions of EKG readings.
- `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology.
- `Triglycerides`: All mentions terms related to specific lab test for Triglycerides.
- `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning").
- `Gender`: Gender-specific nouns and pronouns.
- `Pulse`: Peripheral heart rate, without advanced information like measurement location.
- `Social_History_Header`: Identifies section headers that correspond to Social History of a patient.
- `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs).
- `Diabetes`: All terms related to diabetes mellitus.
- `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately.
- `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye.
- `Clinical_Dept`: Terms that indicate the medical and/or surgical departments.
- `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients.
- `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.).
- `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago").
- `Height`: All mentions related to a patients height.
- `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included).
- `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity.
- `Frequency`: Frequency of administration for a dose prescribed.
- `Time`: Specific time references (hour and/or minutes).
- `Weight`: All mentions related to a patients weight.
- `Vaccine`: Generic and brand name of vaccines or vaccination procedure.
- `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient.
- `Communicable_Disease`: Includes all mentions of communicable diseases.
- `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm).
- `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately).
- `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure).
- `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein).
- `Total_Cholesterol`: Terms related to the lab test and results for cholesterol.
- `Smoking`: All mentions of smoking status of a patient.
- `Date`: Mentions of an exact date, in any format, including day number, month and/or year.
## Predicted Entities
`Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.1.0_2.4_1624566960534.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.1.0_2.4_1624566960534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("jsl_ner")
jsl_ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "jsl_ner"]) \
.setOutputCol("ner_chunk")
jsl_ner_pipeline = Pipeline().setStages([
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
jsl_ner,
jsl_ner_converter])
jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text")
result = jsl_ner_model.transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("jsl_ner")
val jsl_ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "jsl_ner"))
.setOutputCol("ner_chunk")
val jsl_ner_pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
jsl_ner,
jsl_ner_converter))
val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text")
val result = jsl_ner_pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_sw_cased","sw") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_sw_cased","sw")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_sw_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|sw|
|Size:|370.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-sw-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Legal Whereas Clause Binary Classifier
author: John Snow Labs
name: legclf_cuad_whereas_clause
date: 2022-09-20
tags: [en, legal, classification, clauses, whereas, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `whereas` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `whereas`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_whereas_clause_en_1.0.0_3.2_1663693211440.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_whereas_clause_en_1.0.0_3.2_1663693211440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[whereas]|
|[other]|
|[other]|
|[whereas]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_cuad_whereas_clause|
|Type:|legal|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.4 MB|
## References
In-house annotations on CUAD dataset
## Benchmarking
```bash
label precision recall f1-score support
other 0.98 0.94 0.96 67
whereas 0.91 0.98 0.94 41
accuracy - - 0.95 108
macro-avg 0.95 0.96 0.95 108
weighted-avg 0.96 0.95 0.95 108
```
---
layout: model
title: English RobertaForQuestionAnswering (from saattrupdan)
author: John Snow Labs
name: roberta_qa_icebert_texas_squad_is_saattrupdan
date: 2022-06-21
tags: [is, open_source, question_answering, roberta]
task: Question Answering
language: is
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `icebert-texas-squad-is` is a English model originally trained by `saattrupdan`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_icebert_texas_squad_is_saattrupdan_is_4.0.0_3.0_1655789280018.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_icebert_texas_squad_is_saattrupdan_is_4.0.0_3.0_1655789280018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_icebert_texas_squad_is_saattrupdan","is") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_icebert_texas_squad_is_saattrupdan","is")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("is.answer_question.squad.roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_icebert_texas_squad_is_saattrupdan|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|is|
|Size:|455.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/saattrupdan/icebert-texas-squad-is
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_fpdm_ft_news
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_ft_news_en_4.3.0_3.0_1674211000201.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_ft_news_en_4.3.0_3.0_1674211000201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_ft_news","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_ft_news","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_fpdm_ft_news|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|458.6 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/fpdm_roberta_FT_newsqa
---
layout: model
title: Hindi Named Entity Recognition (from l3cube-pune)
author: John Snow Labs
name: bert_ner_hing_bert_lid
date: 2022-05-09
tags: [bert, ner, token_classification, hi, open_source]
task: Named Entity Recognition
language: hi
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `hing-bert-lid` is a Hindi model orginally trained by `l3cube-pune`.
## Predicted Entities
`EN`, `HI`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_hing_bert_lid_hi_3.4.2_3.0_1652097677950.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_hing_bert_lid_hi_3.4.2_3.0_1652097677950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_hing_bert_lid","hi") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["मुझे स्पार्क एनएलपी बहुत पसंद है"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_hing_bert_lid","hi")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("मुझे स्पार्क एनएलपी बहुत पसंद है").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_hing_bert_lid|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|hi|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/l3cube-pune/hing-bert-lid
- https://github.com/l3cube-pune/code-mixed-nlp
- https://arxiv.org/abs/2204.08398
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4_en_4.3.0_3.0_1674213450529.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4_en_4.3.0_3.0_1674213450529.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|439.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-4
---
layout: model
title: Fast Neural Machine Translation Model from Central Bikol to Swedish
author: John Snow Labs
name: opus_mt_bcl_sv
date: 2021-06-01
tags: [open_source, seq2seq, translation, bcl, sv, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: bcl
target languages: sv
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_sv_xx_3.1.0_2.4_1622552497758.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_sv_xx_3.1.0_2.4_1622552497758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_bcl_sv", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_bcl_sv", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Central Bikol.translate_to.Swedish').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_bcl_sv|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English ElectraForQuestionAnswering Large model (from mrm8488)
author: John Snow Labs
name: electra_qa_large_finetuned_squadv1
date: 2022-06-22
tags: [en, open_source, electra, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-large-finetuned-squadv1` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_large_finetuned_squadv1_en_4.0.0_3.0_1655920852902.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_large_finetuned_squadv1_en_4.0.0_3.0_1655920852902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_large_finetuned_squadv1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_large_finetuned_squadv1","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.electra.large.by_mrm8488").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_large_finetuned_squadv1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/mrm8488/electra-large-finetuned-squadv1
---
layout: model
title: Multilingual BertForQuestionAnswering model (from horsbug98)
author: John Snow Labs
name: bert_qa_Part_1_mBERT_Model_E1
date: 2022-06-02
tags: [en, ar, bn, fi, id, ja, sw, ko, ru, te, th, open_source, question_answering, bert, xx]
task: Question Answering
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_1_mBERT_Model_E1` is a Multilingual model orginally trained by `horsbug98`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Part_1_mBERT_Model_E1_xx_4.0.0_3.0_1654178889694.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Part_1_mBERT_Model_E1_xx_4.0.0_3.0_1654178889694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Part_1_mBERT_Model_E1","xx") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_Part_1_mBERT_Model_E1","xx")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.answer_question.tydiqa.multi_lingual_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_Part_1_mBERT_Model_E1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/horsbug98/Part_1_mBERT_Model_E1
---
layout: model
title: BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SST-2
author: John Snow Labs
name: sent_bert_wiki_books_sst2
date: 2021-08-31
tags: [en, open_source, sentence_embeddings, wikipedia_dataset, books_corpus_dataset, sst_2_dataset]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on SST-2. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings.
This model is intended to be used for a variety of English NLP tasks. The pre-training data contains more formal text and the model may not generalize to more colloquial text such as social media or messages.
This model is fine-tuned on the SST-2 and is recommended for use in sentiment analysis tasks. The fine-tuning task uses the Stanford Sentiment Treebank (SST-2) dataset to predict the sentiment in a given sentence.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_sst2_en_3.2.0_3.0_1630412133457.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_sst2_en_3.2.0_3.0_1630412133457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_sst2", "en") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
```
```scala
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_sst2", "en")
.setInputCols("sentence")
.setOutputCol("bert_sentence")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
sent_embeddings_df = nlu.load('en.embed_sentence.bert.wiki_books_sst2').predict(text, output_level='sentence')
sent_embeddings_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_wiki_books_sst2|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[bert_sentence]|
|Language:|en|
|Case sensitive:|false|
## Data Source
[1]: [Wikipedia dataset](https://dumps.wikimedia.org/)
[2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb)
[3]: [Stanford Sentiment Treebank (SST-2) dataset](https://nlp.stanford.edu/sentiment/index.html)
This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/sst2/2
---
layout: model
title: English DistilBertForSequenceClassification Base Uncased model (from mrm8488)
author: John Snow Labs
name: distilbert_classifier_base_uncased_newspop_student
date: 2022-07-20
tags: [open_source, distilbert, sequence_classifier, classification, newspop, en]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Classification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-newspop-student` is a Spanish model originally trained by `mrm8488`.
## Predicted Entities
`palestine`, `obama`, `microsoft`, `economy`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_classifier_base_uncased_newspop_student_en_4.0.0_3.0_1658326819970.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_classifier_base_uncased_newspop_student_en_4.0.0_3.0_1658326819970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
seq = DistilBertForSequenceClassification.pretrained("distilbert_classifier_base_uncased_newspop_student","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, seq])
data = spark.createDataFrame([["PUT YOUR STRING HERE."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val seq = DistilBertForSequenceClassification.pretrained("distilbert_classifier_base_uncased_newspop_student","en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, seq))
val data = Seq("PUT YOUR STRING HERE.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_classifier_base_uncased_newspop_student|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|249.8 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
https://huggingface.co/mrm8488/distilbert-base-uncased-newspop-student
---
layout: model
title: English BertForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_hier_triplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191205969.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191205969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/rule_based_hier_triplet_epochs_1_shard_1_squad2.0
---
layout: model
title: Stopwords Remover for Kyrgyz language (96 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, ky, open_source]
task: Stop Words Removal
language: ky
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ky_3.4.1_3.0_1646673146794.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ky_3.4.1_3.0_1646673146794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","ky") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Сен менден артык эмессиң"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ky")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Сен менден артык эмессиң").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ky.stopwords").predict("""Сен менден артык эмессиң""")
```
## Results
```bash
+-----------------------------+
|result |
+-----------------------------+
|[Сен, менден, артык, эмессиң]|
+-----------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|ky|
|Size:|1.8 KB|
---
layout: model
title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili
date: 2022-08-01
tags: [sw, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: sw
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`.
## Predicted Entities
`PER`, `DATE`, `ORG`, `LOC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili_sw_4.1.0_3.0_1659354352086.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili_sw_4.1.0_3.0_1659354352086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili","sw") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili","sw")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|sw|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://github.com/masakhane-io/masakhane-ner
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-32-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0_en_4.0.0_3.0_1655732398416.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0_en_4.0.0_3.0_1655732398416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_32d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|417.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-32-finetuned-squad-seed-0
---
layout: model
title: Chinese BertForMaskedLM Cased model (from hfl)
author: John Snow Labs
name: bert_embeddings_rbtl3
date: 2022-12-06
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbtl3` is a Chinese model originally trained by `hfl`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbtl3_zh_4.2.4_3.0_1670327142675.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbtl3_zh_4.2.4_3.0_1670327142675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_rbtl3|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|228.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/hfl/rbtl3
- https://arxiv.org/abs/1906.08101
- https://github.com/google-research/bert
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/1906.08101
---
layout: model
title: English T5ForConditionalGeneration Cased model (from gokceuludogan)
author: John Snow Labs
name: t5_t2t_adex_prompt
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t2t-adeX-prompt` is a English model originally trained by `gokceuludogan`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_t2t_adex_prompt_en_4.3.0_3.0_1675107607187.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_t2t_adex_prompt_en_4.3.0_3.0_1675107607187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_t2t_adex_prompt","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_t2t_adex_prompt","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_t2t_adex_prompt|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|925.3 MB|
## References
- https://huggingface.co/gokceuludogan/t2t-adeX-prompt
- https://github.com/gokceuludogan/boun-tabi-smm4h22
---
layout: model
title: Detect PHI for Deidentification purposes (Spanish, augmented)
author: John Snow Labs
name: ner_deid_subentity_augmented
date: 2022-02-16
tags: [deid, es, licensed]
task: De-identification
language: es
edition: Healthcare NLP 3.3.4
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities, which is more than the previously released `ner_deid_subentity` model.
This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf and MeddoCan datasets, and includes several data augmentation mechanisms.
## Predicted Entities
`PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `USERNAME`, `SEX`, `EMAIL`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_es_3.3.4_3.0_1645006642756.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_es_3.3.4_3.0_1645006642756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("word_embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_augmented", "es", "clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner])
text = ['''
Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
''']
data = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "es", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner))
val text = "Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."
val data = Seq(text).toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.med_ner.deid.subentity_augmented").predict("""
Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
""")
```
## Results
```bash
+-------+
|result|
+-------+
|[separation-agreement]|
|[other]|
|[other]|
|[separation-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_separation_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
other 0.93 0.95 0.94 82
separation-agreement 0.88 0.83 0.85 35
accuracy - - 0.91 117
macro-avg 0.90 0.89 0.90 117
weighted-avg 0.91 0.91 0.91 117
```
---
layout: model
title: English asr_wav2vec2_base_100h_ngram TFWav2Vec2ForCTC from saahith
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_100h_ngram
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_ngram` is a English model originally trained by saahith.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_ngram_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_ngram_en_4.2.0_3.0_1664042368247.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_ngram_en_4.2.0_3.0_1664042368247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_ngram', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_ngram", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_100h_ngram|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|227.9 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from google)
author: John Snow Labs
name: t5_efficient_base_dl6
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-dl6` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl6_en_4.3.0_3.0_1675110178520.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl6_en_4.3.0_3.0_1675110178520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_dl6","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_dl6","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_dl6|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|366.3 MB|
## References
- https://huggingface.co/google/t5-efficient-base-dl6
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from stevhliu)
author: John Snow Labs
name: distilbert_qa_my_awesome_model
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `my_awesome_qa_model` is a English model originally trained by `stevhliu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_my_awesome_model_en_4.3.0_3.0_1672775319390.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_my_awesome_model_en_4.3.0_3.0_1672775319390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_my_awesome_model","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_my_awesome_model","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_my_awesome_model|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/stevhliu/my_awesome_qa_model
---
layout: model
title: Clinical Deidentification (Spanish)
author: John Snow Labs
name: clinical_deidentification
date: 2023-06-13
tags: [deid, es, licensed]
task: De-identification
language: es
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline is trained with sciwiki_300d embeddings and can be used to deidentify PHI information from medical texts in Spanish. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `E-MAIL`, `USERNAME`, `LOCATION`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `MEDICALRECORD`, `IDNUM`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR`
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_4.4.4_3.2_1686663754234.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_4.4.4_3.2_1686663754234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from johnsnowlabs import *
deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models")
sample = """Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
"""
result = deid_pipeline .annotate(sample)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "es", "clinical/models")
sample = "Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
"
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.deid.clinical").predict("""Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from johnsnowlabs import *
deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models")
sample = """Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
"""
result = deid_pipeline .annotate(sample)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "es", "clinical/models")
sample = "Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
"
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.deid.clinical").predict("""Datos del paciente.
Nombre: Jose .
Apellidos: Aranda Martinez.
NHC: 2748903.
NASS: 26 37482910 04.
Domicilio: Calle Losada Martí 23, 5 B..
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 15/04/1977.
País: España.
Edad: 37 años Sexo: F.
Fecha de Ingreso: 05/06/2018.
Médico: María Merino Viveros NºCol: 28 28 35489.
Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com
""")
```
## Results
```bash
Results
Masked with entity labels
------------------------------
Datos del paciente.
Nombre: .
Apellidos: .
NHC: .
NASS: 04.
Domicilio: , 5 B..
Localidad/ Provincia: .
CP: .
Datos asistenciales.
Fecha de nacimiento: .
País: .
Edad: años Sexo: .
Fecha de Ingreso: .
: María Merino Viveros NºCol: .
Informe clínico del paciente: de años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
Servicio de Endocrinología y Nutrición km 12,500 28905 - () Correo electrónico:
Masked with chars
------------------------------
Datos del paciente.
Nombre: [**] .
Apellidos: [*************].
NHC: [*****].
NASS: ** [******] 04.
Domicilio: [*******************], 5 B..
Localidad/ Provincia: [****].
CP: [***].
Datos asistenciales.
Fecha de nacimiento: [********].
País: [****].
Edad: ** años Sexo: *.
Fecha de Ingreso: [********].
[****]: María Merino Viveros NºCol: ** ** [***].
Informe clínico del paciente: [***] de ** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en [*********] en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: [*************] +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente [****] significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
[******************] [******************************] Servicio de Endocrinología y Nutrición [*****************] km 12,500 28905 [****] - [****] ([****]) Correo electrónico: [********************]
Masked with fixed length chars
------------------------------
Datos del paciente.
Nombre: **** .
Apellidos: ****.
NHC: ****.
NASS: **** **** 04.
Domicilio: ****, 5 B..
Localidad/ Provincia: ****.
CP: ****.
Datos asistenciales.
Fecha de nacimiento: ****.
País: ****.
Edad: **** años Sexo: ****.
Fecha de Ingreso: ****.
****: María Merino Viveros NºCol: **** **** ****.
Informe clínico del paciente: **** de **** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en **** en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: **** +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente **** significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
**** **** Servicio de Endocrinología y Nutrición **** km 12,500 28905 **** - **** (****) Correo electrónico: ****
Obfuscated
------------------------------
Datos del paciente.
Nombre: Sr. Lerma .
Apellidos: Aristides Gonzalez Gelabert.
NHC: BBBBBBBBQR648597.
NASS: 041010000011 RZRM020101906017 04.
Domicilio: Valencia, 5 B..
Localidad/ Provincia: Madrid.
CP: 99335.
Datos asistenciales.
Fecha de nacimiento: 25/04/1977.
País: Barcelona.
Edad: 8 años Sexo: F..
Fecha de Ingreso: 02/08/2018.
transportista: María Merino Viveros NºCol: olegario10 olegario10 felisa78.
Informe clínico del paciente: RZRM020101906017 de 8 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias.
Antes de comenzar el cuadro estuvo en Madrid en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado.
Entre los comensales aparecieron varios casos de brucelosis.
Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha.
La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación.
En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos.
Auscultación cardíaca rítmica, sin soplos, roces ni extratonos.
Auscultación pulmonar con conservación del murmullo vesicular.
Abdomen blando, depresible, sin masas ni megalias.
En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad.
Extremidades sin varices ni edemas.
Pulsos periféricos presentes y simétricos.
En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva.
Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3.
VSG: 40 mm 1ª hora.
Coagulación: TQ 87%;
TTPA 25,8 seg.
Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl.
Orina: sedimento normal.
Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Dra. Laguna +++;
Test de Coombs > 1/1280; Brucellacapt > 1/5120.
Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos.
Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas).
El paciente 041010000011 significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro.
Remitido por: Dra.
Reinaldo Manjón Malo Barcelona Servicio de Endocrinología y Nutrición Valencia km 12,500 28905 Bilbao - Madrid (Barcelona) Correo electrónico: quintanasalome@example.net
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|es|
|Size:|281.3 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- Finisher
---
layout: model
title: English image_classifier_vit_rust_image_classification_2 ViTForImageClassification from SummerChiam
author: John Snow Labs
name: image_classifier_vit_rust_image_classification_2
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_2` is a English model originally trained by SummerChiam.
## Predicted Entities
`nonrust`, `rust`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_2_en_4.1.0_3.0_1660166968473.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_2_en_4.1.0_3.0_1660166968473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_rust_image_classification_2", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_rust_image_classification_2", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_rust_image_classification_2|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Legal Employee benefit plans Clause Binary Classifier (md)
author: John Snow Labs
name: legclf_employee_benefit_plans_md
date: 2022-11-25
tags: [en, legal, classification, document, agreement, contract, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `employee-benefit-plans` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `employee-benefit-plans`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employee_benefit_plans_md_en_1.0.0_3.0_1669376476318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employee_benefit_plans_md_en_1.0.0_3.0_1669376476318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[employee-benefit-plans]|
|[other]|
|[other]|
|[employee-benefit-plans]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_employee_benefit_plans_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
precision recall f1-score support
employee-benefit-plans 0.97 1.00 0.98 28
other 1.00 0.97 0.99 39
accuracy 0.99 67
macro avg 0.98 0.99 0.98 67
weighted avg 0.99 0.99 0.99 67
```
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from huxxx657)
author: John Snow Labs
name: roberta_qa_base_finetuned_scrambled_squad_5
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-5` is a English model originally trained by `huxxx657`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_5_en_4.3.0_3.0_1674216944002.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_5_en_4.3.0_3.0_1674216944002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_5","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_5","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_finetuned_scrambled_squad_5|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-5
---
layout: model
title: NER Model Finder with Sentence Entity Resolvers (sbert_jsl_medium_uncased)
author: John Snow Labs
name: sbertresolve_ner_model_finder
date: 2021-11-24
tags: [ner, licensed, en, clinical, entity_resolver]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.2
spark_version: 2.4
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities (NER labels) to the most appropriate NER model using `sbert_jsl_medium_uncased` Sentence Bert Embeddings. Given the entity name, it will return a list of pretrained NER models having that entity or similar ones.
## Predicted Entities
`NER Model Names`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_3.3.2_2.4_1637764208798.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_3.3.2_2.4_1637764208798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
ner_model_finder = SentenceEntityResolverModel\
.pretrained("sbertresolve_ner_model_finder", "en", "clinical/models")\
.setInputCols(["ner_chunk", "sbert_embeddings"])\
.setOutputCol("model_names")\
.setDistanceFunction("EUCLIDEAN")
ner_model_finder_pipelineModel = PipelineModel(stages = [documentAssembler, sbert_embedder, ner_model_finder])
light_pipeline = LightPipeline(ner_model_finder_pipelineModel)
annotations = light_pipeline.fullAnnotate("medication")
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val ner_model_finder = SentenceEntityResolverModel
.pretrained("sbertresolve_ner_model_finder", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("model_names")
.setDistanceFunction("EUCLIDEAN")
val ner_model_finder_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, ner_model_finder))
val light_pipeline = LightPipeline(ner_model_finder_pipelineModel)
val annotations = light_pipeline.fullAnnotate("medication")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.ner.model_finder").predict("""Put your text here.""")
```
## Results
```bash
+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity |models |all_models |resolutions |
+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|medication|['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_ade_clinical', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']|['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_ade_clinical', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']:::['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_clinical_large', 'ner_healthcare', 'ner_jsl_enriched', 'ner_clinical', 'ner_jsl_slim', 'ner_covid_trials', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_events_admission_clinical', 'ner_events_healthcare', 'ner_events_clinical', 'ner_jsl_greedy']:::['ner_medmentions_coarse']:::['ner_jsl_enriched', 'ner_covid_trials', 'ner_jsl', 'ner_medmentions_coarse']:::['ner_drugs']:::['ner_clinical_icdem', 'ner_medmentions_coarse']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_medmentions_coarse', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_medmentions_coarse', 'ner_radiology_wip_clinical', 'ner_jsl_slim', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy', 'ner_radiology']:::['ner_medmentions_coarse','ner_clinical_icdem']:::['ner_posology_experimental']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_measurements_clinical', 'ner_radiology_wip_clinical', 'ner_radiology']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_posology_small', 'ner_jsl_enriched', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_posology_greedy', 'ner_posology', 'ner_jsl_greedy']:::['ner_covid_trials', 'ner_medmentions_coarse', 'jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_deid_subentity_augmented', 'ner_deid_subentity_glove', 'ner_deidentify_dl', 'ner_deid_enriched']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_covid_trials', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_medmentions_coarse', 'jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_chemd_clinical']|medication:::drug:::treatment:::therapeutic procedure:::drug ingredient:::drug chemical:::diagnostic aid:::substance:::medical device:::diagnostic procedure:::administration:::measurement:::drug strength:::physiological reaction:::patient:::vaccine:::psychological condition:::abbreviation|
+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbertresolve_ner_model_finder|
|Compatibility:|Healthcare NLP 3.3.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sbert_embeddings]|
|Output Labels:|[models]|
|Language:|en|
|Case sensitive:|false|
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_dl6
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-dl6` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_dl6_en_4.3.0_3.0_1675123262536.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_dl6_en_4.3.0_3.0_1675123262536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_dl6","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_dl6","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_dl6|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|62.3 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-dl6
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Financial NER (Headers / Subheaders)
author: John Snow Labs
name: finner_headers
date: 2022-08-29
tags: [en, finance, ner, headers, splitting, sections, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Named Entity Recognition model, which will help you split long financial documents into smaller sections. To do that, it detects Headers and Subheaders of different sections. You can then use the beginning and end information in the metadata to retrieve the text between those headers.
This model has been trained on 10-K filings, with the following HEADER and SUBHEADERS annotation guidelines:
- PART I, PART II, etc are HEADERS
- Item 1, Item 2, etc are also HEADERS
- Item 1A, 2B, etc are SUBHEADERS
- 1., 2., 2.1, etc. are SUBHEADERS
- Other kind of short section names are also SUBHEADERS
For more information about long document splitting, see [this](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb) workshop entry.
## Predicted Entities
`HEADER`, `SUBHEADER`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_HEADERS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_headers_en_1.0.0_3.2_1661771922923.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_headers_en_1.0.0_3.2_1661771922923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained('finner_headers', 'en', 'finance/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""
2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller.
2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6
2.2 Customer Agreements."""]
res = model.transform(spark.createDataFrame([text]).toDF("text"))
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_est_qa","et") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_est_qa","et")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("et.answer_question.xlm_roberta.by_anukaver").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_est_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|et|
|Size:|883.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anukaver/xlm-roberta-est-qa
---
layout: model
title: English BertForQuestionAnswering model (from madlag)
author: John Snow Labs
name: bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squadv1-x2.44-f87.7-d26-hybrid-filled-v1` is a English model orginally trained by `madlag`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1_en_4.0.0_3.0_1654181609802.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1_en_4.0.0_3.0_1654181609802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base_uncased_x2.32_f86.6_d15_hybrid_v1.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|174.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/madlag/bert-base-uncased-squadv1-x2.44-f87.7-d26-hybrid-filled-v1
- https://rajpurkar.github.io/SQuAD-explorer
- https://www.aclweb.org/anthology/N19-1423.pdf
---
layout: model
title: Fast Neural Machine Translation Model from English to Japanese
author: John Snow Labs
name: opus_mt_en_jap
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, jap, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `jap`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_jap_xx_2.7.0_2.4_1609168310758.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_jap_xx_2.7.0_2.4_1609168310758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_jap", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_jap", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.jap').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_jap|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RoBERTa Embeddings (from jackaduma)
author: John Snow Labs
name: roberta_embeddings_SecRoBERTa
date: 2022-04-14
tags: [roberta, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `SecRoBERTa` is a English model orginally trained by `jackaduma`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_SecRoBERTa_en_3.4.2_3.0_1649946774316.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_SecRoBERTa_en_3.4.2_3.0_1649946774316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_SecRoBERTa","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_SecRoBERTa","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.SecRoBERTa").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_SecRoBERTa|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|314.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/jackaduma/SecRoBERTa
- https://github.com/jackaduma/SecBERT/
- https://github.com/kbandla/APTnotes
- https://stucco.github.io/data/
- https://ebiquity.umbc.edu/_file_directory_/papers/943.pdf
- https://competitions.codalab.org/competitions/17262
- https://github.com/allenai/scibert
- https://github.com/jackaduma/SecBERT
- https://github.com/jackaduma/SecBERT
---
layout: model
title: Social Determinants of Health (clinical_large)
author: John Snow Labs
name: ner_sdoh_emb_clinical_large_wip
date: 2023-04-17
tags: [en, clinical_large, social_determinants, public_health, ner, sdoh, pyspark_30, licensed]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts terminology related to Social Determinants of Health from various kinds of biomedical documents.
## Predicted Entities
`Access_To_Care`, `Age`, `Alcohol`, `Chidhood_Event`, `Communicable_Disease`, `Community_Safety`, `Diet`, `Disability`, `Eating_Disorder`, `Education`, `Employment`, `Environmental_Condition`, `Exercise`, `Family_Member`, `Financial_Status`, `Food_Insecurity`, `Gender`, `Geographic_Entity`, `Healthcare_Institution`, `Housing`, `Hyperlipidemia`, `Hypertension`, `Income`, `Insurance_Status`, `Language`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Obesity`, `Other_Disease`, `Other_SDoH_Keywords`, `Population_Group`, `Quality_Of_Life`, `Race_Ethnicity`, `Sexual_Activity`, `Sexual_Orientation`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Spiritual_Beliefs`, `Substance_Duration`, `Substance_Frequency`, `Substance_Quantity`, `Substance_Use`, `Transportation`, `Violence_Or_Abuse`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_emb_clinical_large_wip_en_4.3.2_3.0_1681756284245.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_emb_clinical_large_wip_en_4.3.2_3.0_1681756284245.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_large_wip", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
])
sample_texts = [["Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week."]]
data = spark.createDataFrame(sample_texts).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_large_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
))
val data = Seq("Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_large_wip", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
])
sample_texts = [["Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week."]]
data = spark.createDataFrame(sample_texts).toDF("text")
result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_sdoh_emb_clinical_large_wip|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.0 MB|
|Dependencies:|embeddings_clinical_large|
## References
Internal SHOP Project
## Benchmarking
```bash
label precision recall f1-score support
Employment 0.94 0.96 0.95 2075
Social_Support 0.91 0.90 0.90 658
Other_SDoH_Keywords 0.82 0.87 0.85 259
Healthcare_Institution 0.99 0.95 0.97 781
Alcohol 0.96 0.97 0.96 258
Gender 0.99 0.99 0.99 4957
Other_Disease 0.89 0.94 0.91 583
Access_To_Care 0.86 0.88 0.87 520
Mental_Health 0.89 0.81 0.85 494
Age 0.92 0.96 0.94 433
Marital_Status 1.00 1.00 1.00 92
Substance_Quantity 0.88 0.86 0.87 58
Substance_Use 0.91 0.97 0.94 192
Family_Member 0.97 0.99 0.98 2094
Financial_Status 0.86 0.65 0.74 124
Race_Ethnicity 0.93 0.93 0.93 27
Insurance_Status 0.93 0.87 0.90 85
Spiritual_Beliefs 0.86 0.81 0.83 52
Housing 0.88 0.85 0.87 400
Geographic_Entity 0.86 0.88 0.87 113
Disability 0.93 0.93 0.93 44
Quality_Of_Life 0.89 0.75 0.81 67
Income 0.89 0.77 0.83 31
Education 0.85 0.88 0.86 58
Transportation 0.86 0.89 0.88 57
Legal_Issues 0.72 0.91 0.80 47
Smoking 0.98 0.97 0.98 66
Substance_Frequency 0.93 0.75 0.83 57
Hypertension 1.00 1.00 1.00 21
Violence_Or_Abuse 0.83 0.62 0.71 63
Exercise 0.96 0.88 0.92 57
Diet 0.95 0.87 0.91 70
Sexual_Orientation 0.68 1.00 0.81 13
Language 0.89 0.73 0.80 22
Social_Exclusion 0.96 0.90 0.93 29
Substance_Duration 0.75 0.85 0.80 39
Communicable_Disease 1.00 0.84 0.91 31
Chidhood_Event 0.88 0.61 0.72 23
Community_Safety 0.95 0.93 0.94 44
Population_Group 0.89 0.62 0.73 13
Hyperlipidemia 0.78 1.00 0.88 7
Food_Insecurity 1.00 0.93 0.96 29
Eating_Disorder 0.67 0.92 0.77 13
Sexual_Activity 0.84 0.90 0.87 29
Environmental_Condition 1.00 1.00 1.00 20
Obesity 1.00 1.00 1.00 12
micro-avg 0.95 0.95 0.95 15217
macro-avg 0.90 0.88 0.88 15217
weighted-avg 0.95 0.95 0.95 15217
```
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from Nadav)
author: John Snow Labs
name: roberta_qa_base_squad_finetuned_on_runaways
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad-finetuned-on-runaways-en` is a English model originally trained by `Nadav`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_finetuned_on_runaways_en_4.3.0_3.0_1674218728000.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_finetuned_on_runaways_en_4.3.0_3.0_1674218728000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad_finetuned_on_runaways","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad_finetuned_on_runaways","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_squad_finetuned_on_runaways|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|467.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Nadav/roberta-base-squad-finetuned-on-runaways-en
---
layout: model
title: Chinese T5ForConditionalGeneration Cased model (from IDEA-CCNL)
author: John Snow Labs
name: t5_randeng_77m_multitask_chinese
date: 2023-01-30
tags: [zh, open_source, t5, tensorflow]
task: Text Generation
language: zh
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Randeng-T5-77M-MultiTask-Chinese` is a Chinese model originally trained by `IDEA-CCNL`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_randeng_77m_multitask_chinese_zh_4.3.0_3.0_1675098367899.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_randeng_77m_multitask_chinese_zh_4.3.0_3.0_1675098367899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_randeng_77m_multitask_chinese","zh") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_randeng_77m_multitask_chinese","zh")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_randeng_77m_multitask_chinese|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|zh|
|Size:|349.2 MB|
## References
- https://huggingface.co/IDEA-CCNL/Randeng-T5-77M-MultiTask-Chinese
- https://github.com/IDEA-CCNL/Fengshenbang-LM
- https://fengshenbang-doc.readthedocs.io/
- http://jmlr.org/papers/v21/20-074.html
- https://github.com/IDEA-CCNL/Fengshenbang-LM/
- https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/pretrain_t5
- https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/mt5_summary
- https://github.com/IDEA-CCNL/Fengshenbang-LM/
- https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/pretrain_t5
- https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/mt5_summary
- https://arxiv.org/abs/2209.02970
- https://arxiv.org/abs/2209.02970
- https://github.com/IDEA-CCNL/Fengshenbang-LM/
- https://github.com/IDEA-CCNL/Fengshenbang-LM/
---
layout: model
title: English BertForQuestionAnswering Cased model (from clementgyj)
author: John Snow Labs
name: bert_qa_finetuned_squad_50k
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-50k` is a English model originally trained by `clementgyj`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_finetuned_squad_50k_en_4.0.0_3.0_1657186899398.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_finetuned_squad_50k_en_4.0.0_3.0_1657186899398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_finetuned_squad_50k","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_finetuned_squad_50k","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_finetuned_squad_50k|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/clementgyj/bert-finetuned-squad-50k
---
layout: model
title: French CamemBert Embeddings (from fjluque)
author: John Snow Labs
name: camembert_embeddings_fjluque_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `fjluque`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_fjluque_generic_model_fr_3.4.4_3.0_1653988580244.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_fjluque_generic_model_fr_3.4.4_3.0_1653988580244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_fjluque_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_fjluque_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_fjluque_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/fjluque/dummy-model
---
layout: model
title: English asr_Urdu_repo TFWav2Vec2ForCTC from bilalahmed15
author: John Snow Labs
name: asr_Urdu_repo
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Urdu_repo` is a English model originally trained by bilalahmed15.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Urdu_repo_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Urdu_repo_en_4.2.0_3.0_1664107184096.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Urdu_repo_en_4.2.0_3.0_1664107184096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_Urdu_repo", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_Urdu_repo", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_Urdu_repo|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: Finnish BERT Sentence Embeddings (Base Cased)
author: John Snow Labs
name: sent_bert_finnish_cased
date: 2020-08-31
task: Embeddings
language: fi
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, fi]
supported: true
deprecated: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A version of Google's BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. `FinBERT` features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words.
`FinBERT` has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train `FinBERT`.
These features allow `FinBERT` to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_finnish_cased_fi_2.6.0_2.4_1598897560014.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_finnish_cased_fi_2.6.0_2.4_1598897560014.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_bert_finnish_cased", "fi") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['Vihaan syöpää', 'antibiootit eivät ole kipulääkkeitä']], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_finnish_cased", "fi")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("Vihaan syöpää","antibiootit eivät ole kipulääkkeitä").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["Vihaan syöpää","antibiootit eivät ole kipulääkkeitä"]
embeddings_df = nlu.load('fi.embed_sentence.bert.cased').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
sentence fi_embed_sentence_bert_cased_embeddings
Vihaan syöpää [-0.32807931303977966, -0.18222537636756897, 0...
antibiootit eivät ole kipulääkkeitä [-0.192955881357193, -0.11151257902383804, 0.7...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_bert_finnish_cased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[fi]|
|Dimension:|768|
|Case sensitive:|true|
{:.h2_title}
## Data Source
The model is imported from https://github.com/TurkuNLP/FinBERT
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from SEISHIN)
author: John Snow Labs
name: distilbert_qa_seishin_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `SEISHIN`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_seishin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769121878.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_seishin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769121878.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_seishin_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_seishin_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_seishin_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/SEISHIN/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Translate English to Bislama Pipeline
author: John Snow Labs
name: translate_en_bi
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, bi, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `bi`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_bi_xx_2.7.0_2.4_1609698734459.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_bi_xx_2.7.0_2.4_1609698734459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_bi", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_bi", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.bi').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_bi|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Extract Demographic Entities from Social Determinants of Health Texts
author: John Snow Labs
name: ner_sdoh_demographics_wip
date: 2023-02-10
tags: [licensed, clinical, social_determinants, en, ner, demographics, sdoh, public_health]
task: Named Entity Recognition
language: en
nav_key: modelsr
edition: Healthcare NLP 4.2.8
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts demographic information related to Social Determinants of Health from various kinds of biomedical documents.
## Predicted Entities
`Family_Member`, `Age`, `Gender`, `Geographic_Entity`, `Race_Ethnicity`, `Language`, `Spiritual_Beliefs`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_demographics_wip_en_4.2.8_3.0_1675998706136.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_demographics_wip_en_4.2.8_3.0_1675998706136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_sdoh_demographics_wip", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
])
sample_texts = ["SOCIAL HISTORY: He is a former tailor from Korea.",
"He lives alone,single and no children.",
"Pt is a 61 years old married, Caucasian, Catholic woman. Pt speaks English reasonably well."]
data = spark.createDataFrame(sample_texts, StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_sdoh_demographics_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
))
val data = Seq("Pt is a 61 years old married, Caucasian, Catholic woman. Pt speaks English reasonably well.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+-----------------+-----+---+------------+
|ner_label |begin|end|chunk |
+-----------------+-----+---+------------+
|Gender |16 |17 |He |
|Geographic_Entity|43 |47 |Korea |
|Gender |0 |1 |He |
|Family_Member |29 |36 |children |
|Age |8 |19 |61 years old|
|Race_Ethnicity |30 |38 |Caucasian |
|Spiritual_Beliefs|41 |48 |Catholic |
|Gender |50 |54 |woman |
|Language |67 |73 |English |
+-----------------+-----+---+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_sdoh_demographics_wip|
|Compatibility:|Healthcare NLP 4.2.8+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|858.4 KB|
## Benchmarking
```bash
label tp fp fn total precision recall f1
Age 1346.0 73.0 74.0 1420.0 0.948555 0.947887 0.948221
Spiritual_Beliefs 100.0 13.0 16.0 116.0 0.884956 0.862069 0.873362
Family_Member 4468.0 134.0 43.0 4511.0 0.970882 0.990468 0.980577
Race_Ethnicity 56.0 0.0 13.0 69.0 1.000000 0.811594 0.896000
Gender 9825.0 67.0 247.0 10072.0 0.993227 0.975477 0.984272
Geographic_Entity 225.0 9.0 29.0 254.0 0.961538 0.885827 0.922131
Language 51.0 9.0 5.0 56.0 0.850000 0.910714 0.879310
```
---
layout: model
title: Explain Document ML Pipeline for English
author: John Snow Labs
name: explain_document_ml
date: 2022-06-24
tags: [open_source, english, explain_document_ml, pipeline, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The explain_document_ml is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_ml_en_4.0.0_3.0_1656066222624.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_ml_en_4.0.0_3.0_1656066222624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('explain_document_ml', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("explain_document_ml", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.explain').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | spell | lemmas | stems | pos |
|---:|:---------------------------------|:---------------------------------|:-------------------------------------------------|:------------------------------------------------|:------------------------------------------------|:-----------------------------------------------|:---------------------------------------|
| 0 | ['Hello fronm John Snwow Labs!'] | ['Hello fronm John Snwow Labs!'] | ['Hello', 'fronm', 'John', 'Snwow', 'Labs', '!'] | ['Hello', 'front', 'John', 'Snow', 'Labs', '!'] | ['Hello', 'front', 'John', 'Snow', 'Labs', '!'] | ['hello', 'front', 'john', 'snow', 'lab', '!'] | ['UH', 'NN', 'NNP', 'NNP', 'NNP', '.'] || | document | sentence | token | spell | lemmas | stems | pos |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|explain_document_ml|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|9.6 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- NorvigSweetingModel
- LemmatizerModel
- Stemmer
- PerceptronModel
---
layout: model
title: Legal Tariff Policy Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_tariff_policy_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, tariff_policy, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_tariff_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Tariff_Policy or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`Tariff_Policy`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_tariff_policy_bert_en_1.0.0_3.0_1678111753015.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_tariff_policy_bert_en_1.0.0_3.0_1678111753015.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Tariff_Policy]|
|[Other]|
|[Other]|
|[Tariff_Policy]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_tariff_policy_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.82 0.85 0.83 969
Tariff_Policy 0.87 0.85 0.86 1175
accuracy - - 0.85 2144
macro-avg 0.85 0.85 0.85 2144
weighted-avg 0.85 0.85 0.85 2144
```
---
layout: model
title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla)
author: John Snow Labs
name: bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-32-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4_en_4.0.0_3.0_1657185004862.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4_en_4.0.0_3.0_1657185004862.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-32-finetuned-squad-seed-4
---
layout: model
title: Detect Subentity PHI for Deidentification (Arabic)
author: John Snow Labs
name: ner_deid_subentity
date: 2023-05-31
tags: [licensed, clinical, ner, deidentification, arabic, ar]
task: Named Entity Recognition
language: ar
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Arabic) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities. This NER model is trained with a combination of custom datasets, and several data augmentation mechanisms. This model Word2Vec Arabic Clinical Embeddings.
## Predicted Entities
`PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `SEX`, `IDNUM`, `EMAIL`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_ar_4.4.2_3.0_1685559675615.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_ar_4.4.2_3.0_1685559675615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ar", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter])
text = '''
عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة في 15/05/2000 في مستشفى مدينة الرباط. رقم هاتفه هو 0610948235 وبريده الإلكتروني
mohamedmell@gmail.com.
'''
data = spark.createDataFrame([[text]]).toDF("text")
results = nlpPipeline .fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ar", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter))
text = '''
عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة في 15/05/2000 في مستشفى مدينة الرباط. رقم هاتفه هو 0610948235 وبريده الإلكتروني
mohamedmell@gmail.com.
'''
val data = Seq(text).toDS.toDF("text")
val results = pipeline.fit(data).transform(data)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","min") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","min")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("min.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|min|
|Size:|142.8 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: English RobertaForQuestionAnswering (from Teepika)
author: John Snow Labs
name: roberta_qa_roberta_base_squad2_finetuned_selqa
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-selqa` is a English model originally trained by `Teepika`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_finetuned_selqa_en_4.0.0_3.0_1655735329360.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_finetuned_selqa_en_4.0.0_3.0_1655735329360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_finetuned_selqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_squad2_finetuned_selqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.base.by_Teepika").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_squad2_finetuned_selqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Teepika/roberta-base-squad2-finetuned-selqa
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from teacookies)
author: John Snow Labs
name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265899` is a English model originally trained by `teacookies`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899_en_4.0.0_3.0_1655984561078.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899_en_4.0.0_3.0_1655984561078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265899").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|888.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265899
---
layout: model
title: Stopwords Remover for Lithuanian language (1314 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, lt, open_source]
task: Stop Words Removal
language: lt
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_lt_3.4.1_3.0_1646673057631.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_lt_3.4.1_3.0_1646673057631.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","lt") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Jūs nesate geresnis už mane"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","lt")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Jūs nesate geresnis už mane").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("lt.stopwords").predict("""Jūs nesate geresnis už mane""")
```
## Results
```bash
+------------------+
|result |
+------------------+
|[nesate, geresnis]|
+------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|lt|
|Size:|4.8 KB|
---
layout: model
title: English RobertaForQuestionAnswering (from billfrench)
author: John Snow Labs
name: roberta_qa_cyberlandr_door
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cyberlandr-door` is a English model originally trained by `billfrench`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cyberlandr_door_en_4.0.0_3.0_1655728103928.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cyberlandr_door_en_4.0.0_3.0_1655728103928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_cyberlandr_door","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_cyberlandr_door","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.roberta.by_billfrench").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_cyberlandr_door|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|413.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/billfrench/cyberlandr-door
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from arunkumar629)
author: John Snow Labs
name: distilbert_qa_arunkumar629_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `arunkumar629`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_arunkumar629_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769951769.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_arunkumar629_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769951769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_arunkumar629_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_arunkumar629_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_arunkumar629_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/arunkumar629/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you TFWav2Vec2ForCTC from project2you
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you` is a English model originally trained by project2you.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you_en_4.2.0_3.0_1664110153624.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you_en_4.2.0_3.0_1664110153624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English BertForQuestionAnswering model (from Intel)
author: John Snow Labs
name: bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-squadv1.1-sparse-80-1x4-block-pruneofa` is a English model orginally trained by `Intel`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa_en_4.0.0_3.0_1654536766214.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa_en_4.0.0_3.0_1654536766214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.large_uncased_sparse_80_1x4_block_pruneofa.by_Intel").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|437.9 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Intel/bert-large-uncased-squadv1.1-sparse-80-1x4-block-pruneofa
- https://arxiv.org/abs/2111.05754
- https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all
---
layout: model
title: Detect Founding / Listing dates in texts (small)
author: John Snow Labs
name: finner_wiki_founding_dates
date: 2023-01-15
tags: [listing, founding, establishment, dates, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is an NER model, aimed to detect Establishment (Founding) and Listing dates of Companies. It was trained with wikipedia texts about companies.
## Predicted Entities
`FOUNDING_DATE`, `LISTING_DATE`, `O`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_wiki_founding_dates_en_1.0.0_3.0_1673798045941.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_wiki_founding_dates_en_1.0.0_3.0_1673798045941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
text = "The Toro Company, formerly known as the Toro Motor Company, is an American company founded in 1980. It was listed on the NASDAQ Global Market in August 2000. It design and operates lawn mowers and snow blowers and irrigation system supplies."
df = spark.createDataFrame([[text]]).toDF("text")
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencizer = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
chunks = finance.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
ner = finance.NerModel().pretrained("finner_wiki_founding_dates", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
pipe = nlp.Pipeline(stages=[documenter, sentencizer, tokenizer, embeddings, ner, chunks])
model = pipe.fit(df)
res = model.transform(df)
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['3']['sentence']").alias("sentence_id"),
F.expr("cols['0']").alias("chunk"),
F.expr("cols['2']").alias("end"),
F.expr("cols['3']['entity']").alias("ner_label"))\
.filter("ner_label!='O'")\
.show(truncate=False)
```
## Results
```bash
+-----------+-----------+---+-------------+
|sentence_id|chunk |end|ner_label |
+-----------+-----------+---+-------------+
|0 |1980 |97 |FOUNDING_DATE|
|1 |August 2000|155|LISTING_DATE |
+-----------+-----------+---+-------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finner_wiki_founding_dates|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|1.1 MB|
## References
Wikipedia
## Benchmarking
```bash
label tp fp fn prec rec f1
B-LISTING_DATE 10 0 4 1.0 0.71428573 0.8333334
B-FOUNDING_DATE 18 3 2 0.85714287 0.9 0.87804884
I-LISTING_DATE 8 0 1 1.0 0.8888889 0.94117653
Macro-average 36 4 9 4 0.9 0.8 0.8470588
Micro-average 36 4 9 4 0.9 0.8 0.8470588
```
---
layout: model
title: Document Visual Question Answering with DONUT
author: John Snow Labs
name: docvqa_donut_base
date: 2023-01-17
tags: [en, licensed]
task: Document Visual Question Answering
language: en
nav_key: models
edition: Visual NLP 4.3.0
spark_version: 3.2.1
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Document understanding transformer (Donut) model pretrained for Document Visual Question Answering (DocVQA) task in the dataset is from Document Visual Question Answering [competition](https://rrc.cvc.uab.es/?ch=17) and consists of 50K questions defined on more than 12K documents.
Donut is a new method of document understanding that utilizes an OCR-free end-to-end Transformer model. Donut does not require off-the-shelf OCR engines/APIs, yet it shows state-of-the-art performances on various visual document understanding tasks, such as visual document classification or information extraction (a.k.a. document parsing). Paper link [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) developed by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han and Seunghyun Park.
DocVQA seeks to inspire a “purpose-driven” point of view in Document Analysis and Recognition research, where the document content is extracted and used to respond to high-level tasks defined by the human consumers of this information.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/ocr/VISUAL_QUESTION_ANSWERING/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrVisualQuestionAnswering.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/docvqa_donut_base_en_4.3.0_3.0_1673269990044.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/docvqa_donut_base_en_4.3.0_3.0_1673269990044.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
binary_to_image = BinaryToImage()\
.setInputCol("content") \
.setOutputCol("image") \
.setImageType(ImageType.TYPE_3BYTE_BGR)
visual_question_answering = VisualQuestionAnswering()\
.pretrained("docvqa_donut_base", "en", "clinical/ocr")\
.setInputCol(["image"])\
.setOutputCol("answers")\
.setQuestionsCol("questions")
# OCR pipeline
pipeline = PipelineModel(stages=[
binary_to_image,
visual_question_answering
])
test_image_path = pkg_resources.resource_filename('sparkocr', 'resources/ocr/vqa/agenda.png')
bin_df = spark.read.format("binaryFile").load(test_image_path)
questions = [["When it finish the Coffee Break?", "Who is giving the Introductory Remarks?", "Who is going to take part of the individual interviews?"]]
questions_df = spark.createDataFrame([questions])
questions_df = questions_df.withColumnRenamed("_1", "questions")
image_and_questions = bin_df.join(questions_df)
results = pipeline.transform(image_and_questions).cache()
results.select(results.answers).show(truncate=False)
```
```scala
val binary_to_image = new BinaryToImage()
.setInputCol("content")
.setOutputCol("image")
.setImageType(ImageType.TYPE_3BYTE_BGR)
val visual_question_answering = VisualQuestionAnswering()
.pretrained("docvqa_donut_base", "en", "clinical/ocr")
.setInputCol(Array("image"))
.setOutputCol("answers")
.setQuestionsCol("questions")
# OCR pipeline
val pipeline = new PipelineModel().setStages(Array(
binary_to_image,
visual_question_answering))
val test_image_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/vqa/agenda.png")
val bin_df = spark.read.format("binaryFile").load(test_image_path)
val questions = Array("When it finish the Coffee Break?", "Who is giving the Introductory Remarks?", "Who is going to take part of the individual interviews?")
val questions_df = spark.createDataFrame(Array(questions))
val questions_df = questions_df.withColumnRenamed("_1", "questions")
val image_and_questions = bin_df.join(questions_df)
val results = pipeline.transform(image_and_questions).cache()
results.select(results.answers).show(truncate=False)
```
## Example
### Input:
```bash
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|questions |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ When it finish the Coffee Break?, Who is giving the Introductory Remarks?, Who is going to take part of the individual interviews?
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

### Output:
```bash
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|answers |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ When it finish the Coffee Break? -> 11:44 to 11:39 a.m., Who is giving the Introductory Remarks? -> lee a. waller, trrf vice presi- dent, Who is going to take part of the individual interviews? -> trrf treasurer]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
## Model Information
{:.table-model}
|---|---|
|Model Name:|docvqa_donut_base|
|Type:|ocr|
|Compatibility:|Visual NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
---
layout: model
title: ELECTRA Sentence Embeddings(ELECTRA Base)
author: John Snow Labs
name: sent_electra_base_uncased
date: 2020-08-27
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
ELECTRA is a BERT-like model that is pre-trained as a discriminator in a set-up resembling a generative adversarial network (GAN). It was originally published by:
Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning: [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/forum?id=r1xMH1BtvB), ICLR 2020.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_electra_base_uncased_en_2.6.0_2.4_1598489784655.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_electra_base_uncased_en_2.6.0_2.4_1598489784655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_electra_base_uncased", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_electra_base_uncased", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.electra_base_uncased').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
sentence en_embed_sentence_electra_base_uncased_embeddings
I hate cancer [0.18555310368537903, -0.1990899294614792, 0.2...
Antibiotics aren't painkiller [-0.23764970898628235, -0.21351191401481628, -...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_electra_base_uncased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/google/electra_base/2
---
layout: model
title: English image_classifier_vit_base_xray_pneumonia ViTForImageClassification from nickmuchi
author: John Snow Labs
name: image_classifier_vit_base_xray_pneumonia
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_xray_pneumonia` is a English model originally trained by nickmuchi.
## Predicted Entities
`NORMAL`, `PNEUMONIA`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_xray_pneumonia_en_4.1.0_3.0_1660170972982.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_xray_pneumonia_en_4.1.0_3.0_1660170972982.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_base_xray_pneumonia", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_base_xray_pneumonia", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_base_xray_pneumonia|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: English DistilBertForQuestionAnswering model (from Plimpton)
author: John Snow Labs
name: distilbert_qa_Plimpton_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Plimpton`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Plimpton_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724353175.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Plimpton_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724353175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Plimpton_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Plimpton_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Plimpton").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_Plimpton_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Plimpton/distilbert-base-uncased-finetuned-squad
---
layout: model
title: XLM-RoBERTa 40-Language NER Pipeline
author: John Snow Labs
name: xlm_roberta_token_classifier_ner_40_lang_pipeline
date: 2022-06-27
tags: [open_source, ner, token_classifier, xlm_roberta, multilang, "40", xx]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [xlm_roberta_token_classifier_ner_40_lang](https://nlp.johnsnowlabs.com/2021/09/28/xlm_roberta_token_classifier_ner_40_lang_xx.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_token_classifier_ner_40_lang_pipeline_xx_4.0.0_3.0_1656370754079.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_token_classifier_ner_40_lang_pipeline_xx_4.0.0_3.0_1656370754079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("xlm_roberta_token_classifier_ner_40_lang_pipeline", lang = "xx")
pipeline.annotate(["My name is John and I work at John Snow Labs.", "انا اسمي احمد واعمل في ارامكو"])
```
```scala
val pipeline = new PretrainedPipeline("xlm_roberta_token_classifier_ner_40_lang_pipeline", lang = "xx")
pipeline.annotate(Array("My name is John and I work at John Snow Labs.", "انا اسمي احمد واعمل في ارامكو"))
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PER |
|John Snow Labs|ORG |
|احمد |PER |
|ارامكو |ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_token_classifier_ner_40_lang_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|xx|
|Size:|967.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- XlmRoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: Pipeline to Detect Clinical Entities (ner_jsl_enriched)
author: John Snow Labs
name: ner_jsl_enriched_pipeline
date: 2023-03-14
tags: [ner, licensed, clinical, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_jsl_enriched](https://nlp.johnsnowlabs.com/2021/10/22/ner_jsl_enriched_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_pipeline_en_4.3.0_3.2_1678779376891.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_pipeline_en_4.3.0_3.2_1678779376891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_jsl_enriched_pipeline", "en", "clinical/models")
text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_jsl_enriched_pipeline", "en", "clinical/models")
val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.jsl_enriched.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_swedish_cased_squad_experimental","sv") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_swedish_cased_squad_experimental","sv")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("sv.answer_question.squad.bert.base_cased.by_KB").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_swedish_cased_squad_experimental|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|sv|
|Size:|465.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/KB/bert-base-swedish-cased-squad-experimental
---
layout: model
title: Language Detection & Identification Pipeline - 220 Languages
author: John Snow Labs
name: detect_language_220
date: 2020-12-05
task: [Pipeline Public, Language Detection, Sentence Detection]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [language_detection, open_source, pipeline, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate.
We have designed and developed Deep Learning models using CNN architectures in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias).
This pipeline can detect the following languages:
## Predicted Entities
`Achinese`, `Afrikaans`, `Tosk Albanian`, `Amharic`, `Aragonese`, `Old English`, `Arabic`, `Egyptian Arabic`, `Assamese`, `Asturian`, `Avaric`, `Aymara`, `Azerbaijani`, `South Azerbaijani`, `Bashkir`, `Bavarian`, `bat-smg`, `Central Bikol`, `Belarusian`, `Bulgarian`, `bh`, `Bengali`, `Tibetan`, `Bishnupriya`, `Breton`, `Russia Buriat`, `Catalan`, `Min Dong Chinese`, `Chechen`, `Cebuano`, `Central Kurdish (Soranî)`, `Corsican`, `Crimean Tatar`, `Czech`, `Kashubian`, `Chuvash`, `Welsh`, `Danish`, `German`, `Dimli (individual language)`, `Lower Sorbian`, `Dhivehi`, `Greek`, `eml`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Basque`, `Extremaduran`, `Persian`, `Finnish`, `fiu-vro`, `Faroese`, `French`, `Arpitan`, `Friulian`, `Frisian`, `Irish`, `Gagauz`, `Scottish Gaelic`, `Galician`, `Guarani`, `Konkani (Goan)`, `Gujarati`, `Manx`, `Hausa`, `Hakka Chinese`, `Hebrew`, `Hindi`, `Fiji Hindi`, `Upper Sorbian`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Indonesian`, `Interlingue`, `Igbo`, `Ilocano`, `Ido`, `Icelandic`, `Italian`, `Japanese`, `Jamaican Patois`, `Lojban`, `Javanese`, `Georgian`, `Karakalpak`, `Kabyle`, `Kabardian`, `Kazakh`, `Khmer`, `Kannada`, `Korean`, `Komi-Permyak`, `Karachay-Balkar`, `Kölsch`, `Kurdish`, `Komi`, `Cornish`, `Kyrgyz`, `Latin`, `Ladino`, `Luxembourgish`, `Lezghian`, `Luganda`, `Limburgan`, `Ligurian`, `Lombard`, `Lingala`, `Lao`, `Northern Luri`, `Lithuanian`, `Latvian`, `Maithili`, `map-bms`, `Malagasy`, `Meadow Mari`, `Maori`, `Minangkabau`, `Macedonian`, `Malayalam`, `Mongolian`, `Marathi`, `Hill Mari`, `Maltese`, `Mirandese`, `Burmese`, `Erzya`, `Mazanderani`, `Nahuatl`, `Neapolitan`, `Low German (Low Saxon)`, `nds-nl`, `Nepali`, `Newari`, `Dutch`, `Norwegian Nynorsk`, `Norwegian`, `Narom`, `Pedi`, `Navajo`, `Occitan`, `Livvi`, `Oromo`, `Odia (Oriya)`, `Ossetian`, `Punjabi (Eastern)`, `Pangasinan`, `Kapampangan`, `Papiamento`, `Picard`, `Palatine German`, `Polish`, `Punjabi (Western)`, `Pashto`, `Portuguese`, `Quechua`, `Romansh`, `Romanian`, `roa-tara`, `Russian`, `Rusyn`, `Kinyarwanda`, `Sanskrit`, `Yakut`, `Sardinian`, `Sicilian`, `Scots`, `Sindhi`, `Northern Sami`, `Sinhala`, `Slovak`, `Slovenian`, `Shona`, `Somali`, `Albanian`, `Serbian`, `Saterland Frisian`, `Sundanese`, `Swedish`, `Swahili`, `Silesian`, `Tamil`, `Tulu`, `Telugu`, `Tetun`, `Tajik`, `Thai`, `Turkmen`, `Tagalog`, `Setswana`, `Tongan`, `Turkish`, `Tatar`, `Tuvinian`, `Udmurt`, `Uyghur`, `Ukrainian`, `Urdu`, `Uzbek`, `Venetian`, `Veps`, `Vietnamese`, `Vlaams`, `Volapük`, `Walloon`, `Waray`, `Wolof`, `Shanghainese`, `Xhosa`, `Mingrelian`, `Yiddish`, `Yoruba`, `Zeeuws`, `Chinese`, `zh-classical`, `zh-min-nan`, `zh-yue`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/LANGUAGE_DETECTOR/){:.button.button-orange.button-orange-trans.co.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_220_xx_2.7.0_2.4_1607185721383.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/detect_language_220_xx_2.7.0_2.4_1607185721383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("detect_language_220", lang = "xx")
pipeline.annotate("French author who helped pioneer the science-fiction genre.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("detect_language_220", lang = "xx")
pipeline.annotate("French author who helped pioneer the science-fiction genre.")
```
{:.nlu-block}
```python
import nlu
text = ["French author who helped pioneer the science-fiction genre."]
lang_df = nlu.load("xx.classify.lang.220").predict(text)
lang_df
```
## Results
```bash
{'document': ['French author who helped pioneer the science-fiction genre.'],
'language': ['en']}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|detect_language_220|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- LanguageDetectorDL
---
layout: model
title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011)
author: John Snow Labs
name: distilbert_token_classifier_autotrain_name_all_904029577
date: 2023-03-14
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_all-904029577` is a English model originally trained by `ismail-lucifer011`.
## Predicted Entities
`Name`, `OOV`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029577_en_4.3.1_3.0_1678783486467.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029577_en_4.3.1_3.0_1678783486467.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029577","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029577","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_autotrain_name_all_904029577|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ismail-lucifer011/autotrain-name_all-904029577
---
layout: model
title: English DistilBertForQuestionAnswering model (from aszidon) Custom
author: John Snow Labs
name: distilbert_qa_custom
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom` is a English model originally trained by `aszidon`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom_en_4.0.0_3.0_1654727944431.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom_en_4.0.0_3.0_1654727944431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.custom.by_aszidon").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_custom|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aszidon/distilbertcustom
---
layout: model
title: Legal Time Clause Binary Classifier
author: John Snow Labs
name: legclf_time_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `time` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `time`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_time_clause_en_1.0.0_3.2_1660124089546.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_time_clause_en_1.0.0_3.2_1660124089546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[time]|
|[other]|
|[other]|
|[time]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_time_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.2 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.98 0.99 0.99 304
time 0.98 0.96 0.97 150
accuracy - - 0.98 454
macro-avg 0.98 0.98 0.98 454
weighted-avg 0.98 0.98 0.98 454
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from SauravMaheshkar)
author: John Snow Labs
name: distilbert_qa_base_cased_led_chaii
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-chaii` is a English model originally trained by `SauravMaheshkar`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_chaii_en_4.3.0_3.0_1672766429348.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_chaii_en_4.3.0_3.0_1672766429348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_chaii","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_chaii","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_cased_led_chaii|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/SauravMaheshkar/distilbert-base-cased-distilled-chaii
---
layout: model
title: Sentence Entity Resolver for Snomed Concepts, Body Structure Version
author: John Snow Labs
name: sbertresolve_snomed_bodyStructure_med
date: 2021-07-08
tags: [snomed, en, entity_resolution, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.1.0
spark_version: 2.4
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical (anatomical structures) entities to Snomed codes (body structure version) using sentence embeddings.
## Predicted Entities
Snomed Codes and their normalized definition with `sbert_jsl_medium_uncased ` embeddings.
{:.btn-box}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_bodyStructure_med_en_3.1.0_2.4_1625772026635.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_bodyStructure_med_en_3.1.0_2.4_1625772026635.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
jsl_sbert_embedder = BertSentenceEmbeddings\
.pretrained('sbert_jsl_medium_uncased','en','clinical/models')\
.setInputCols(["ner_chunk"])\
.setOutputCol("sbert_embeddings")
snomed_resolver = SentenceEntityResolverModel\
.pretrained("sbertresolve_snomed_bodyStructure_med", "en", "clinical/models") \
.setInputCols(["ner_chunk", "sbert_embeddings"]) \
.setOutputCol("snomed_code")
snomed_pipelineModel = PipelineModel(
stages = [
documentAssembler,
jsl_sbert_embedder,
snomed_resolver])
snomed_lp = LightPipeline(snomed_pipelineModel)
result = snomed_lp.fullAnnotate("Amputation stump")
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")
.setInputCols(Array("ner_chunk"))
.setOutputCol("sbert_embeddings")
val snomed_resolver = SentenceEntityResolverModel
.pretrained("sbertresolve_snomed_bodyStructure_med", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("snomed_code")
val snomed_pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, snomed_resolver))
val snomed_lp = LightPipeline(snomed_pipelineModel)
val result = snomed_lp.fullAnnotate("Amputation stump")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.snomed_body_structure_med").predict("""Amputation stump""")
```
## Results
```bash
| | chunks | code | resolutions | all_codes | all_distances |
|---:|:-----------------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------|
| 0 | amputation stump | 38033009 | [Amputation stump, Amputation stump of upper limb, Amputation stump of left upper limb, Amputation stump of lower limb, Amputation stump of left lower limb, Amputation stump of right upper limb, Amputation stump of right lower limb, ...]| ['38033009', '771359009', '771364008', '771358001', '771367001', '771365009', '771368006', ...] | ['0.0000', '0.0773', '0.0858', '0.0863', '0.0905', '0.0911', '0.0972', ...] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbertresolve_snomed_bodyStructure_med|
|Compatibility:|Healthcare NLP 3.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[snomed_code]|
|Language:|en|
|Case sensitive:|true|
## Data Source
https://www.snomed.org/
---
layout: model
title: German asr_exp_w2v2t_vp_s962 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: pipeline_asr_exp_w2v2t_vp_s962
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_s962` is a German model originally trained by jonatasgrosman.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2t_vp_s962_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_s962_de_4.2.0_3.0_1664111856089.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_s962_de_4.2.0_3.0_1664111856089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2t_vp_s962', lang = 'de')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2t_vp_s962", lang = "de")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_exp_w2v2t_vp_s962|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|de|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Detect Anatomical References (biobert)
author: John Snow Labs
name: ner_anatomy_biobert
date: 2021-04-01
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Detect anatomical sites and references in medical text using pretrained NER model.
## Predicted Entities
`tissue_structure`, `Organism_substance`, `Developing_anatomical_structure`, `Cell`, `Cellular_component`, `Immaterial_anatomical_entity`, `Organ`, `Pathological_formation`, `Organism_subdivision`, `Anatomical_system`, `Tissue`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_en_3.0.0_3.0_1617260624773.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_en_3.0.0_3.0_1617260624773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_anatomy_biobert", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_anatomy_biobert", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.anatomy.biobert").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_anatomy_biobert|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Benchmarking
```bash
+-------------------------------+-----+----+----+-----+---------+------+------+
| entity| tp| fp| fn|total|precision|recall| f1|
+-------------------------------+-----+----+----+-----+---------+------+------+
| Organ| 53.0|17.0|12.0| 65.0| 0.7571|0.8154|0.7852|
| Pathological_formation| 83.0|23.0|14.0| 97.0| 0.783|0.8557|0.8177|
| Organism_substance| 42.0| 1.0|14.0| 56.0| 0.9767| 0.75|0.8485|
| tissue_structure|131.0|28.0|49.0|180.0| 0.8239|0.7278|0.7729|
| Cellular_component| 17.0| 0.0|20.0| 37.0| 1.0|0.4595|0.6296|
| Tissue| 27.0| 4.0|16.0| 43.0| 0.871|0.6279|0.7297|
| Anatomical_system| 15.0| 3.0| 8.0| 23.0| 0.8333|0.6522|0.7317|
|Developing_anatomical_structure| 2.0| 1.0| 3.0| 5.0| 0.6667| 0.4| 0.5|
| Immaterial_anatomical_entity| 7.0| 2.0| 6.0| 13.0| 0.7778|0.5385|0.6364|
| Cell|180.0| 6.0|15.0|195.0| 0.9677|0.9231|0.9449|
| Organism_subdivision| 11.0| 5.0|10.0| 21.0| 0.6875|0.5238|0.5946|
+-------------------------------+-----+----+----+-----+---------+------+------+
+------------------+
| macro|
+------------------+
|0.7264701979913192|
+------------------+
+------------------+
| micro|
+------------------+
|0.8108878300337679|
+------------------+
```
---
layout: model
title: Part of Speech for Hindi
author: John Snow Labs
name: pos_ud_hdtb
date: 2020-07-29 23:34:00 +0800
task: Part of Speech Tagging
language: hi
edition: Spark NLP 2.5.5
spark_version: 2.4
tags: [pos, hi]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_hdtb_hi_2.5.5_2.4_1596054066666.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_hdtb_hi_2.5.5_2.4_1596054066666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_hdtb", "hi") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_hdtb", "hi")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।"""]
pos_df = nlu.load('hi.pos').predict(text, output_level='token')
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=4, result='PROPN', metadata={'word': 'उत्तर'}),
Row(annotatorType='pos', begin=6, end=7, result='ADP', metadata={'word': 'के'}),
Row(annotatorType='pos', begin=9, end=12, result='NOUN', metadata={'word': 'राजा'}),
Row(annotatorType='pos', begin=14, end=17, result='VERB', metadata={'word': 'होने'}),
Row(annotatorType='pos', begin=19, end=20, result='ADP', metadata={'word': 'के'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_hdtb|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.5+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|hi|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Turkish BERT Base Uncased (BERTurk)
author: John Snow Labs
name: bert_base_turkish_uncased
date: 2021-05-20
tags: [open_source, embeddings, bert, turkish, tr]
task: Embeddings
language: tr
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
BERTurk is a community-driven cased BERT model for Turkish. Some datasets used for pretraining and evaluation are contributed from the awesome Turkish NLP community, as well as the decision for the model name: BERTurk.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_turkish_uncased_tr_3.1.0_2.4_1621510523359.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_turkish_uncased_tr_3.1.0_2.4_1621510523359.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = BertEmbeddings.pretrained("bert_base_turkish_uncased", "tr") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
```
```scala
val embeddings = BertEmbeddings.pretrained("bert_base_turkish_uncased", "tr")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
nlu.load("tr.embed.bert.uncased").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_turkish_uncased|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[embeddings]|
|Language:|tr|
|Case sensitive:|true|
## Data Source
[https://huggingface.co/dbmdz/bert-base-turkish-uncased](https://huggingface.co/dbmdz/bert-base-turkish-uncased)
## Benchmarking
```bash
For results on PoS tagging or NER tasks, please refer to
[this repository](https://github.com/stefan-it/turkish-bert).
```
---
layout: model
title: RE Pipeline between Body Parts and Procedures
author: John Snow Labs
name: re_bodypart_proceduretest_pipeline
date: 2023-06-13
tags: [licensed, clinical, relation_extraction, body_part, procedures, en]
task: Relation Extraction
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [re_bodypart_proceduretest](https://nlp.johnsnowlabs.com/2021/01/18/re_bodypart_proceduretest_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_proceduretest_pipeline_en_4.4.4_3.2_1686664541054.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_proceduretest_pipeline_en_4.4.4_3.2_1686664541054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.bodypart_proceduretest.pipeline").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.bodypart_proceduretest.pipeline").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""")
```
## Results
```bash
Results
| index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence |
|-------|-----------|------------------------------|---------------|-------------|--------|---------|-------------|-------------|---------------------|------------|
| 0 | 1 | External_body_part_or_region | 94 | 98 | chest | Test | 117 | 135 | portable ultrasound | 1.0 |
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|re_bodypart_proceduretest_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- PerceptronModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- DependencyParserModel
- RelationExtractionModel
---
layout: model
title: Sentence Detection in Malayalam Text
author: John Snow Labs
name: sentence_detector_dl
date: 2021-08-30
tags: [ml, sentence_detection, open_source]
task: Sentence Detection
language: ml
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ml_3.2.0_3.0_1630336657068.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ml_3.2.0_3.0_1630336657068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl", "ml") \
.setInputCols(["document"]) \
.setOutputCol("sentences")
sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL]))
sd_model.fullAnnotate("""ഇംഗ്ലീഷ് വായിക്കുന്ന ഖണ്ഡികകളുടെ മികച്ച ഉറവിടം തേടുകയാണോ? നിങ്ങൾ ശരിയായ സ്ഥലത്ത് എത്തിയിരിക്കുന്നു. അടുത്തിടെ നടത്തിയ ഒരു പഠനമനുസരിച്ച്, ഇന്നത്തെ യുവാക്കളിൽ വായനാശീലം അതിവേഗം കുറഞ്ഞുവരികയാണ്. ഒരു നിശ്ചിത സെക്കൻഡിൽ കൂടുതൽ ഒരു ഇംഗ്ലീഷ് വായന ഖണ്ഡികയിൽ ശ്രദ്ധ കേന്ദ്രീകരിക്കാൻ അവർക്ക് കഴിയില്ല! കൂടാതെ, വായന എല്ലാ മത്സര പരീക്ഷകളുടെയും അവിഭാജ്യ ഘടകമായിരുന്നു. അതിനാൽ, നിങ്ങളുടെ വായനാ കഴിവുകൾ എങ്ങനെ മെച്ചപ്പെടുത്താം? ഈ ചോദ്യത്തിനുള്ള ഉത്തരം യഥാർത്ഥത്തിൽ മറ്റൊരു ചോദ്യമാണ്: വായനാ വൈദഗ്ധ്യത്തിന്റെ ഉപയോഗം എന്താണ്? വായനയുടെ പ്രധാന ലക്ഷ്യം 'അർത്ഥവത്താക്കുക' എന്നതാണ്.""")
```
```scala
val documenter = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "ml")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val pipeline = new Pipeline().setStages(Array(documenter, model))
val data = Seq("ഇംഗ്ലീഷ് വായിക്കുന്ന ഖണ്ഡികകളുടെ മികച്ച ഉറവിടം തേടുകയാണോ? നിങ്ങൾ ശരിയായ സ്ഥലത്ത് എത്തിയിരിക്കുന്നു. അടുത്തിടെ നടത്തിയ ഒരു പഠനമനുസരിച്ച്, ഇന്നത്തെ യുവാക്കളിൽ വായനാശീലം അതിവേഗം കുറഞ്ഞുവരികയാണ്. ഒരു നിശ്ചിത സെക്കൻഡിൽ കൂടുതൽ ഒരു ഇംഗ്ലീഷ് വായന ഖണ്ഡികയിൽ ശ്രദ്ധ കേന്ദ്രീകരിക്കാൻ അവർക്ക് കഴിയില്ല! കൂടാതെ, വായന എല്ലാ മത്സര പരീക്ഷകളുടെയും അവിഭാജ്യ ഘടകമായിരുന്നു. അതിനാൽ, നിങ്ങളുടെ വായനാ കഴിവുകൾ എങ്ങനെ മെച്ചപ്പെടുത്താം? ഈ ചോദ്യത്തിനുള്ള ഉത്തരം യഥാർത്ഥത്തിൽ മറ്റൊരു ചോദ്യമാണ്: വായനാ വൈദഗ്ധ്യത്തിന്റെ ഉപയോഗം എന്താണ്? വായനയുടെ പ്രധാന ലക്ഷ്യം 'അർത്ഥവത്താക്കുക' എന്നതാണ്.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load('ml.sentence_detector').predict("ഇംഗ്ലീഷ് വായിക്കുന്ന ഖണ്ഡികകളുടെ മികച്ച ഉറവിടം തേടുകയാണോ? നിങ്ങൾ ശരിയായ സ്ഥലത്ത് എത്തിയിരിക്കുന്നു. അടുത്തിടെ നടത്തിയ ഒരു പഠനമനുസരിച്ച്, ഇന്നത്തെ യുവാക്കളിൽ വായനാശീലം അതിവേഗം കുറഞ്ഞുവരികയാണ്. ഒരു നിശ്ചിത സെക്കൻഡിൽ കൂടുതൽ ഒരു ഇംഗ്ലീഷ് വായന ഖണ്ഡികയിൽ ശ്രദ്ധ കേന്ദ്രീകരിക്കാൻ അവർക്ക് കഴിയില്ല! കൂടാതെ, വായന എല്ലാ മത്സര പരീക്ഷകളുടെയും അവിഭാജ്യ ഘടകമായിരുന്നു. അതിനാൽ, നിങ്ങളുടെ വായനാ കഴിവുകൾ എങ്ങനെ മെച്ചപ്പെടുത്താം? ഈ ചോദ്യത്തിനുള്ള ഉത്തരം യഥാർത്ഥത്തിൽ മറ്റൊരു ചോദ്യമാണ്: വായനാ വൈദഗ്ധ്യത്തിന്റെ ഉപയോഗം എന്താണ്? വായനയുടെ പ്രധാന ലക്ഷ്യം 'അർത്ഥവത്താക്കുക' എന്നതാണ്.", output_level ='sentence')
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------------------------+
|[ഇംഗ്ലീഷ് വായിക്കുന്ന ഖണ്ഡികകളുടെ മികച്ച ഉറവിടം തേടുകയാണോ?] |
|[നിങ്ങൾ ശരിയായ സ്ഥലത്ത് എത്തിയിരിക്കുന്നു.] |
|[അടുത്തിടെ നടത്തിയ ഒരു പഠനമനുസരിച്ച്, ഇന്നത്തെ യുവാക്കളിൽ വായനാശീലം അതിവേഗം കുറഞ്ഞുവരികയാണ്.] |
|[ഒരു നിശ്ചിത സെക്കൻഡിൽ കൂടുതൽ ഒരു ഇംഗ്ലീഷ് വായന ഖണ്ഡികയിൽ ശ്രദ്ധ കേന്ദ്രീകരിക്കാൻ അവർക്ക് കഴിയില്ല!]|
|[കൂടാതെ, വായന എല്ലാ മത്സര പരീക്ഷകളുടെയും അവിഭാജ്യ ഘടകമായിരുന്നു.] |
|[അതിനാൽ, നിങ്ങളുടെ വായനാ കഴിവുകൾ എങ്ങനെ മെച്ചപ്പെടുത്താം?] |
|[ഈ ചോദ്യത്തിനുള്ള ഉത്തരം യഥാർത്ഥത്തിൽ മറ്റൊരു ചോദ്യമാണ്:] |
|[വായനാ വൈദഗ്ധ്യത്തിന്റെ ഉപയോഗം എന്താണ്?] |
|[വായനയുടെ പ്രധാന ലക്ഷ്യം 'അർത്ഥവത്താക്കുക' എന്നതാണ്.] |
+----------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sentence_detector_dl|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[sentences]|
|Language:|ml|
## Benchmarking
```bash
label Accuracy Recall Prec F1
0 0.98 1.00 0.96 0.98
```
---
layout: model
title: English BertForQuestionAnswering model (from kaporter)
author: John Snow Labs
name: bert_qa_kaporter_bert_base_uncased_finetuned_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `kaporter`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_kaporter_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181111131.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_kaporter_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181111131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_kaporter_bert_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_kaporter_bert_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base_uncased.by_kaporter").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_kaporter_bert_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/kaporter/bert-base-uncased-finetuned-squad
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0_en_4.0.0_3.0_1655731053643.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0_en_4.0.0_3.0_1655731053643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_128d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|422.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-0
---
layout: model
title: TREC(6) Question Classifier
author: John Snow Labs
name: classifierdl_use_trec6
date: 2021-01-08
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 2.7.1
spark_version: 2.4
tags: [classifier, open_source, en, text_classification]
supported: true
annotator: ClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Classify open-domain, fact-based questions into one of the following broad semantic categories: Abbreviation, Description, Entities, Human Beings, Locations, or Numeric Values.
## Predicted Entities
``ABBR``, ``DESC``, ``NUM``, ``ENTY``, ``LOC``, ``HUM``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_TREC/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_TREC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec6_en_2.7.1_2.4_1610118062425.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec6_en_2.7.1_2.4_1610118062425.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained(lang="en") \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
document_classifier = ClassifierDLModel.pretrained('classifierdl_use_trec6', 'en') \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")
nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate('When did the construction of stone circles begin in the UK?')
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val use = UniversalSentenceEncoder.pretrained(lang="en")
.setInputCols(Array("document"))
.setOutputCol("sentence_embeddings")
val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_trec6", "en")
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier))
val data = Seq("When did the construction of stone circles begin in the UK?").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""When did the construction of stone circles begin in the UK?"""]
trec6_df = nlu.load('en.classify.trec6.use').predict(text, output_level='document')
trec6_df[["document", "trec6"]]
```
## Results
```bash
+------------------------------------------------------------------------------------------------+------------+
|document |class |
+------------------------------------------------------------------------------------------------+------------+
|When did the construction of stone circles begin in the UK? | NUM |
+------------------------------------------------------------------------------------------------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_use_trec6|
|Compatibility:|Spark NLP 2.7.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
## Benchmarking
```bash
precision recall f1-score support
ABBR 0.00 0.00 0.00 26
DESC 0.89 0.96 0.92 343
ENTY 0.86 0.86 0.86 391
HUM 0.91 0.90 0.91 366
LOC 0.88 0.91 0.89 233
NUM 0.94 0.94 0.94 274
accuracy 0.89 1633
macro avg 0.75 0.76 0.75 1633
weighted avg 0.88 0.89 0.89 1633
```
## Data Source
This model is trained on the 50 class version of the TREC dataset. http://search.r-project.org/library/textdata/html/dataset_trec.html
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AlirezaBaneshi)
author: John Snow Labs
name: roberta_qa_autotrain_test2_756523214
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-test2-756523214` is a English model originally trained by `AlirezaBaneshi`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523214_en_4.3.0_3.0_1674209197246.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523214_en_4.3.0_3.0_1674209197246.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523214","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523214","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_autotrain_test2_756523214|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|415.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AlirezaBaneshi/autotrain-test2-756523214
---
layout: model
title: Legal NER in Greek Legislations
author: John Snow Labs
name: legner_greek_legislation
date: 2023-04-25
tags: [el, legal, ner, licensed, legislation]
task: Named Entity Recognition
language: el
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Legal NER model extracts the following entities from the Greek legislations:
- `FACILITY`: Facilities, such as police stations, departments, etc.
- `GPE`: Geopolitical Entity; any reference to a geopolitical entity (e.g., country, city, Greek administrative unit, etc.)
- `LEG_REF`: Legislation Reference; any reference to Greek or European legislation
- `ORG`: Organization; any reference to a public or private organization
- `PER`: Any formal name of a person mentioned in the text
- `PUBLIC_DOC`: Public Document Reference
## Predicted Entities
`FACILITY`, `GPE`, `LEG_REF`, `PUBLIC_DOC`, `PER`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_greek_legislation_el_1.0.0_3.0_1682420832367.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_greek_legislation_el_1.0.0_3.0_1682420832367.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_el_cased","el")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_greek_legislation", "el", "legal/models")\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text_list = ["""3 του άρθρου 5 του ν. 3148/2003, όπως ισχύει, αντικαθίσταται ως εξής""",
"""1 του άρθρου 1 ασκούνται πλέον από την ΕΥΔΕ/ΕΣΕΑ μέσα σε δύο μήνες από την έναρξη ισχύος του παρόντος Διατάγματος.""",
"""Ο Πρόεδρος της Επιτροπής και τα τέσσερα μέλη με ισάριθμα αναπληρωματικά εκλέγονται μεταξύ των δημοτών του Δήμου Κυθήρων.""",
"""Τη με αριθ. 117/Σ.10η/25 Ιουλ 2016 γνωμοδότηση του Ανωτάτου Στρατιωτικού Συμβουλίου."""]
result = model.transform(spark.createDataFrame(pd.DataFrame({"text" : text_list})))
```
## Results
```bash
+----------------------------------------+----------+
|chunk |ner_label |
+----------------------------------------+----------+
|ν. 3148/2003 |LEG_REF |
|ΕΥΔΕ/ΕΣΕΑ |ORG |
|Δήμου Κυθήρων |GPE |
|αριθ. 117/Σ.10η/25 Ιουλ 2016 γνωμοδότηση|PUBLIC_DOC|
+----------------------------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_greek_legislation|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|el|
|Size:|16.4 MB|
## References
In-house annotations
## Benchmarking
```bash
label precision recall f1-score support
FACILITY 0.94 0.80 0.86 64
GPE 0.77 0.83 0.80 136
LEG_REF 0.94 0.90 0.92 93
ORG 0.85 0.74 0.79 173
PER 0.72 0.71 0.71 58
PUBLIC_DOC 0.76 0.82 0.79 39
micro-avg 0.83 0.80 0.81 563
macro-avg 0.83 0.80 0.81 563
weighted-avg 0.84 0.80 0.82 563
```
---
layout: model
title: Relation Extraction between Biomarkers and Results (ReDL)
author: John Snow Labs
name: redl_oncology_biomarker_result_biobert_wip
date: 2023-01-15
tags: [licensed, clinical, oncology, en, relation_extraction, test, biomarker, tensorflow]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This relation extraction model links Biomarker and Oncogene extractions to their corresponding Biomarker_Result extractions.
## Predicted Entities
`is_finding_of`, `O`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_biomarker_result_biobert_wip_en_4.2.4_3.0_1673766618517.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_biomarker_result_biobert_wip_en_4.2.4_3.0_1673766618517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos_tags")
dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \
.setInputCols(["sentence", "pos_tags", "token"]) \
.setOutputCol("dependencies")
re_ner_chunk_filter = RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene"])
re_model = RelationExtractionDLModel.pretrained("redl_oncology_biomarker_result_biobert_wip", "en", "clinical/models")\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relation_extraction")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model])
data = spark.createDataFrame([["Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos_tags", "token"))
.setOutputCol("dependencies")
val re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunk", "dependencies"))
.setOutputCol("re_ner_chunk")
.setMaxSyntacticDistance(10)
.setRelationPairs(Array("Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene"))
val re_model = RelationExtractionDLModel.pretrained("redl_oncology_biomarker_result_biobert_wip", "en", "clinical/models")
.setInputCols(Array("re_ner_chunk", "sentence"))
.setOutputCol("relation_extraction")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model))
val data = Seq("Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.oncology_biomarker_result_biobert_wip").predict("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""")
```
## Results
```bash
+-------------+----------------+-------------+-----------+--------+----------------+-------------+-----------+--------------------+----------+
| relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence|
+-------------+----------------+-------------+-----------+--------+----------------+-------------+-----------+--------------------+----------+
|is_finding_of|Biomarker_Result| 25| 32|negative| Biomarker| 38| 67|thyroid transcrip...|0.99808085|
|is_finding_of|Biomarker_Result| 25| 32|negative| Biomarker| 73| 78| napsin|0.99637383|
|is_finding_of|Biomarker_Result| 96| 103|positive| Biomarker| 109| 110| ER|0.99221414|
|is_finding_of|Biomarker_Result| 96| 103|positive| Biomarker| 116| 117| PR| 0.9893672|
| O|Biomarker_Result| 96| 103|positive| Oncogene| 137| 140| HER2| 0.9986272|
| O| Biomarker| 109| 110| ER|Biomarker_Result| 124| 131| negative| 0.9999089|
| O| Biomarker| 116| 117| PR|Biomarker_Result| 124| 131| negative| 0.9998932|
|is_finding_of|Biomarker_Result| 124| 131|negative| Oncogene| 137| 140| HER2|0.98810333|
+-------------+----------------+-------------+-----------+--------+----------------+-------------+-----------+--------------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_oncology_biomarker_result_biobert_wip|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|401.7 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label recall precision f1
O 0.93 0.97 0.95
is_finding_of 0.97 0.93 0.95
macro-avg 0.95 0.95 0.95
```
---
layout: model
title: German RobertaForQuestionAnswering (from Gantenbein)
author: John Snow Labs
name: roberta_qa_ADDI_DE_RoBERTa
date: 2022-06-20
tags: [open_source, question_answering, roberta]
task: Question Answering
language: de
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-DE-RoBERTa` is a German model originally trained by `Gantenbein`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_DE_RoBERTa_de_4.0.0_3.0_1655726326883.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_DE_RoBERTa_de_4.0.0_3.0_1655726326883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ADDI_DE_RoBERTa","de") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_ADDI_DE_RoBERTa","de")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.answer_question.roberta.de_tuned.by_Gantenbein").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_ADDI_DE_RoBERTa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|de|
|Size:|422.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/Gantenbein/ADDI-DE-RoBERTa
---
layout: model
title: Detect Adverse Drug Events (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_ade
date: 2022-01-04
tags: [ner, bertfortokenclassification, adverse, ade, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.0
spark_version: 2.4
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Detect adverse reactions of drugs in reviews, tweets, and medical text using the pretrained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP.
## Predicted Entities
`DRUG`, `ADE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_en_3.4.0_2.4_1641283944065.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_en_3.4.0_2.4_1641283944065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_ade", "en", "clinical/models")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_converter = NerConverter() \
.setInputCols(["document","token","ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter])
data = spark.createDataFrame([["""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay."""
]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_ade", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("ner")
.setCaseSensitive(True)
.setMaxSentenceLength(512)
val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, tokenClassifier, ner_converter))
val data = Seq("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.ner_ade").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|Lipitor |DRUG |
|severe fatigue|ADE |
|voltaren |DRUG |
|cramps |ADE |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_ade|
|Compatibility:|Healthcare NLP 3.4.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|404.3 MB|
|Case sensitive:|true|
|Max sentense length:|512|
## Data Source
This model is trained on a custom dataset by John Snow Labs.
## Benchmarking
```bash
label precision recall f1-score support
B-ADE 0.93 0.79 0.85 2694
B-DRUG 0.97 0.87 0.92 9539
I-ADE 0.93 0.73 0.82 3236
I-DRUG 0.95 0.82 0.88 6115
accuracy - - 0.83 21584
macro-avg 0.84 0.84 0.84 21584
weighted-avg 0.95 0.83 0.89 21584
```
---
layout: model
title: Indonesian RoBERTa Embeddings (Base)
author: John Snow Labs
name: roberta_embeddings_indonesian_roberta_base
date: 2022-04-14
tags: [roberta, embeddings, id, open_source]
task: Embeddings
language: id
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indonesian-roberta-base` is a Indonesian model orginally trained by `flax-community`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indonesian_roberta_base_id_3.4.2_3.0_1649948386496.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indonesian_roberta_base_id_3.4.2_3.0_1649948386496.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indonesian_roberta_base","id") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Saya suka percikan NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indonesian_roberta_base","id")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Saya suka percikan NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("id.embed.indonesian_roberta_base").predict("""Saya suka percikan NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_indonesian_roberta_base|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|id|
|Size:|468.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/flax-community/indonesian-roberta-base
- https://arxiv.org/abs/1907.11692
- https://hf.co/w11wo
- https://hf.co/stevenlimcorn
- https://hf.co/munggok
- https://hf.co/chewkokwah
---
layout: model
title: Pipeline to Detect clinical entities (ner_healthcare_slim)
author: John Snow Labs
name: ner_healthcare_slim_pipeline
date: 2023-03-15
tags: [ner, clinical, licensed, de]
task: Named Entity Recognition
language: de
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_healthcare_slim](https://nlp.johnsnowlabs.com/2021/04/01/ner_healthcare_slim_de.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_slim_pipeline_de_4.3.0_3.2_1678879973742.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_slim_pipeline_de_4.3.0_3.2_1678879973742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_healthcare_slim_pipeline", "de", "clinical/models")
text = '''Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_healthcare_slim_pipeline", "de", "clinical/models")
val text = "Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt."
val result = pipeline.fullAnnotate(text)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Volim iskru nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Volim iskru nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("hr.embed.w2v_cc_300d").predict("""Volim iskru nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|hr|
|Size:|1.2 GB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Chinese BertForMaskedLM Cased model (from ptrsxu)
author: John Snow Labs
name: bert_embeddings_ptrsxu_chinese_wwm_ext
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-bert-wwm-ext` is a Chinese model originally trained by `ptrsxu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_ptrsxu_chinese_wwm_ext_zh_4.2.4_3.0_1670020981050.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_ptrsxu_chinese_wwm_ext_zh_4.2.4_3.0_1670020981050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_ptrsxu_chinese_wwm_ext","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_ptrsxu_chinese_wwm_ext","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_ptrsxu_chinese_wwm_ext|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|383.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/ptrsxu/chinese-bert-wwm-ext
- https://arxiv.org/abs/1906.08101
- https://github.com/google-research/bert
- https://github.com/ymcui/Chinese-BERT-wwm
- https://github.com/ymcui/MacBERT
- https://github.com/ymcui/Chinese-ELECTRA
- https://github.com/ymcui/Chinese-XLNet
- https://github.com/airaria/TextBrewer
- https://github.com/ymcui/HFL-Anthology
- https://arxiv.org/abs/2004.13922
- https://arxiv.org/abs/1906.08101
---
layout: model
title: Smaller BERT Embeddings (L-10_H-256_A-4)
author: John Snow Labs
name: small_bert_L10_256
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L10_256_en_2.6.0_2.4_1598344485022.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L10_256_en_2.6.0_2.4_1598344485022.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("small_bert_L10_256", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("small_bert_L10_256", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.bert.small_L10_256').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_bert_small_L10_256_embeddings
I [0.14484411478042603, -0.8349236249923706, -1....
love [-0.7449802160263062, -0.4852253794670105, -0....
NLP [-0.03900821506977081, -0.044783130288124084, ...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|small_bert_L10_256|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|256|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1
---
layout: model
title: Bert Embeddings Romanian (Base Cased)
author: John Snow Labs
name: bert_base_cased
date: 2021-09-13
tags: [open_source, embeddings, ro]
task: Embeddings
language: ro
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus in Romanian Language. The details are described in the paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_cased_ro_3.2.0_3.0_1631533635237.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_ro_3.2.0_3.0_1631533635237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
Generates 768 dimensional embeddings vectors per token
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_cased|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ro|
|Case sensitive:|true|
## Benchmarking
```bash
This model is imported from https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from monakth)
author: John Snow Labs
name: distilbert_qa_monakth_base_cased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-finetuned-squad` is a English model originally trained by `monakth`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_monakth_base_cased_finetuned_squad_en_4.3.0_3.0_1672766954155.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_monakth_base_cased_finetuned_squad_en_4.3.0_3.0_1672766954155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_monakth_base_cased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_monakth_base_cased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_monakth_base_cased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/monakth/distilbert-base-cased-finetuned-squad
---
layout: model
title: English XlmRoBertaForQuestionAnswering (from airesearch)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_base_finetune_qa
date: 2022-06-23
tags: [en, open_source, question_answering, xlmroberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetune-qa` is a English model originally trained by `airesearch`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_finetune_qa_en_4.0.0_3.0_1655989494932.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_finetune_qa_en_4.0.0_3.0_1655989494932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_finetune_qa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_base_finetune_qa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_base_finetune_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|864.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/airesearch/xlm-roberta-base-finetune-qa
- https://wandb.ai/cstorm125/wangchanberta-qa
- https://github.com/vistec-AI/thai2transformers/blob/dev/scripts/downstream/train_question_answering_lm_finetuning.py
---
layout: model
title: Detect Cellular/Molecular Biology Entities (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_cellular
date: 2021-11-03
tags: [bertfortokenclassification, ner, cellular, en, clinical, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.3.0
spark_version: 2.4
supported: true
annotator: MedicalBertForTokenClassifier
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model detects molecular biology-related terms in medical texts. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP.
## Predicted Entities
`DNA`, `Cell_type`, `Cell_line`, `RNA`, `Protein`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_en_3.3.0_2.4_1635938889847.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_en_3.3.0_2.4_1635938889847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_cellular", "en", "clinical/models")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, sentence_detector, tokenizer, tokenClassifier, ner_converter])
p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))
test_sentence = """Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive."""
result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]})))
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_cellular", "en", "clinical/models")
.setInputCols(Array("token", "document"))
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))
val data = Seq("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.cellular").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""")
```
## Results
```bash
+-------------------------------------------+---------+
|chunk |ner_label|
+-------------------------------------------+---------+
|intracellular signaling proteins |protein |
|human T-cell leukemia virus type 1 promoter|DNA |
|Tax |protein |
|Tax-responsive element 1 |DNA |
|cyclic AMP-responsive members |protein |
|CREB/ATF family |protein |
|transcription factors |protein |
|Tax |protein |
|human T-cell leukemia virus type 1 |DNA |
|Tax-responsive element 1 |DNA |
|TRE-1 |DNA |
|lacZ gene |DNA |
|CYC1 promoter |DNA |
|TRE-1 |DNA |
|cyclic AMP response element-binding protein|protein |
|CREB |protein |
|CREB |protein |
|GAL4 activation domain |protein |
|GAD |protein |
|reporter gene |DNA |
|Tax |protein |
+-------------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_cellular|
|Compatibility:|Healthcare NLP 3.3.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[ner]|
|Language:|en|
|Case sensitive:|true|
|Max sentense length:|512|
## Data Source
Trained on the JNLPBA corpus containing more than 2.404 publication abstracts. http://www.geniaproject.org/
## Benchmarking
```bash
label precision recall f1-score support
B-DNA 0.87 0.77 0.82 1056
B-RNA 0.85 0.79 0.82 118
B-cell_line 0.66 0.70 0.68 500
B-cell_type 0.87 0.75 0.81 1921
B-protein 0.90 0.85 0.88 5067
I-DNA 0.93 0.86 0.90 1789
I-RNA 0.92 0.84 0.88 187
I-cell_line 0.67 0.76 0.71 989
I-cell_type 0.92 0.76 0.84 2991
I-protein 0.94 0.80 0.87 4774
accuracy - - 0.80 19392
macro-avg 0.76 0.81 0.78 19392
weighted-avg 0.89 0.80 0.85 19392
```
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from shwetha)
author: John Snow Labs
name: distilbert_qa_autotrain_user_954831770
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-qa-user-954831770` is a English model originally trained by `shwetha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_autotrain_user_954831770_en_4.3.0_3.0_1672765643101.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_autotrain_user_954831770_en_4.3.0_3.0_1672765643101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autotrain_user_954831770","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autotrain_user_954831770","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_autotrain_user_954831770|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/shwetha/autotrain-qa-user-954831770
---
layout: model
title: Swedish asr_lm_swedish TFWav2Vec2ForCTC from birgermoell
author: John Snow Labs
name: pipeline_asr_lm_swedish
date: 2022-09-25
tags: [wav2vec2, sv, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: sv
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_lm_swedish` is a Swedish model originally trained by birgermoell.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_lm_swedish_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_lm_swedish_sv_4.2.0_3.0_1664117937565.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_lm_swedish_sv_4.2.0_3.0_1664117937565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_lm_swedish', lang = 'sv')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_lm_swedish", lang = "sv")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_lm_swedish|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|sv|
|Size:|757.4 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English T5ForConditionalGeneration Tiny Cased model (from google)
author: John Snow Labs
name: t5_efficient_tiny_nl24
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nl24` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl24_en_4.3.0_3.0_1675123867646.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl24_en_4.3.0_3.0_1675123867646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_tiny_nl24","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_tiny_nl24","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_tiny_nl24|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|117.0 MB|
## References
- https://huggingface.co/google/t5-efficient-tiny-nl24
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: French CamemBert Embeddings (from Leisa)
author: John Snow Labs
name: camembert_embeddings_Leisa_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Leisa`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Leisa_generic_model_fr_3.4.4_3.0_1653986639094.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Leisa_generic_model_fr_3.4.4_3.0_1653986639094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Leisa_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Leisa_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_Leisa_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Leisa/dummy-model
---
layout: model
title: Fast and Accurate Language Identification - 231 Languages (CNN)
author: John Snow Labs
name: ld_wiki_cnn_231
date: 2020-12-05
task: Language Detection
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [language_detection, open_source, xx]
supported: true
annotator: LanguageDetectorDL
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate.
We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The model is trained on Wikipedia dataset with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias).
This model can detect the following languages:
`Achinese`, `Afrikaans`, `Tosk Albanian`, `Amharic`, `Aragonese`, `Old English`, `Arabic`, `Egyptian Arabic`, `Assamese`, `Asturian`, `Avaric`, `Aymara`, `Azerbaijani`, `South Azerbaijani`, `Bashkir`, `Bavarian`, `bat-smg`, `Central Bikol`, `Belarusian`, `Bulgarian`, `bh`, `Banjar`, `Bengali`, `Tibetan`, `Bishnupriya`, `Breton`, `Bosnian`, `Russia Buriat`, `Catalan`, `cbk-zam`, `Min Dong Chinese`, `Chechen`, `Cebuano`, `Central Kurdish (Soranî)`, `Corsican`, `Crimean Tatar`, `Czech`, `Kashubian`, `Chuvash`, `Welsh`, `Danish`, `German`, `Dimli (individual language)`, `Lower Sorbian`, `dty`, `Dhivehi`, `Greek`, `eml`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Extremaduran`, `Persian`, `Finnish`, `fiu-vro`, `Faroese`, `French`, `Arpitan`, `Friulian`, `Frisian`, `Irish`, `Gagauz`, `Scottish Gaelic`, `Galician`, `Gilaki`, `Guarani`, `Konkani (Goan)`, `Gujarati`, `Manx`, `Hausa`, `Hakka Chinese`, `Hebrew`, `Hindi`, `Fiji Hindi`, `Croatian`, `Upper Sorbian`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Indonesian`, `Interlingue`, `Igbo`, `Ilocano`, `Ido`, `Icelandic`, `Italian`, `Japanese`, `Jamaican Patois`, `Lojban`, `Javanese`, `Georgian`, `Karakalpak`, `Kabyle`, `Kabardian`, `Kazakh`, `Khmer`, `Kannada`, `Korean`, `Komi-Permyak`, `Karachay-Balkar`, `Kölsch`, `Kurdish`, `Komi`, `Cornish`, `Kyrgyz`, `Latin`, `Ladino`, `Luxembourgish`, `Lezghian`, `Luganda`, `Limburgan`, `Ligurian`, `Lombard`, `Lingala`, `Lao`, `Northern Luri`, `Lithuanian`, `Latgalian`, `Latvian`, `Maithili`, `map-bms`, `Moksha`, `Malagasy`, `Meadow Mari`, `Maori`, `Minangkabau`, `Macedonian`, `Malayalam`, `Mongolian`, `Marathi`, `Hill Mari`, `Malay`, `Maltese`, `Mirandese`, `Burmese`, `Erzya`, `Mazanderani`, `Nahuatl`, `Neapolitan`, `Low German (Low Saxon)`, `nds-nl`, `Nepali`, `Newari`, `Dutch`, `Norwegian Nynorsk`, `Norwegian`, `Narom`, `Pedi`, `Navajo`, `Occitan`, `Livvi`, `Oromo`, `Odia (Oriya)`, `Ossetian`, `Punjabi (Eastern)`, `Pangasinan`, `Kapampangan`, `Papiamento`, `Picard`, `Pennsylvania German`, `Palatine German`, `Polish`, `Punjabi (Western)`, `Pashto`, `Portuguese`, `Quechua`, `Romansh`, `Romanian`, `roa-rup`, `roa-tara`, `Russian`, `Rusyn`, `Kinyarwanda`, `Sanskrit`, `Yakut`, `Sardinian`, `Sicilian`, `Sindhi`, `Northern Sami`, `Serbo-Croatian`, `Sinhala`, `Slovak`, `Slovenian`, `Shona`, `Somali`, `Albanian`, `Serbian`, `Sranan Tongo`, `Saterland Frisian`, `Sundanese`, `Swedish`, `Swahili`, `Silesian`, `Tamil`, `Tulu`, `Telugu`, `Tetun`, `Tajik`, `Thai`, `Turkmen`, `Tagalog`, `Setswana`, `Tongan`, `Turkish`, `Tatar`, `Tuvinian`, `Udmurt`, `Uyghur`, `Ukrainian`, `Urdu`, `Uzbek`, `Venetian`, `Veps`, `Vietnamese`, `Vlaams`, `Volapük`, `Walloon`, `Waray`, `Wolof`, `Shanghainese`, `Xhosa`, `Mingrelian`, `Yiddish`, `Yoruba`, `Zeeuws`, `Chinese`, `zh-classical`, `zh-min-nan`, `zh-yue`.
## Predicted Entities
`ace`, `af`, `als`, `am`, `an`, `ang`, `ar`, `arz`, `as`, `ast`, `av`, `ay`, `az`, `azb`, `ba`, `bar`, `bat-smg`, `bcl`, `be`, `bg`, `bh`, `bjn`, `bn`, `bo`, `bpy`, `br`, `bs`, `bxr`, `ca`, `cbk-zam`, `cdo`, `ce`, `ceb`, `ckb`, `co`, `crh`, `cs`, `csb`, `cv`, `cy`, `da`, `de`, `diq`, `dsb`, `dty`, `dv`, `el`, `eml`, `en`, `eo`, `es`, `et`, `ext`, `fa`, `fi`, `fiu-vro`, `fo`, `fr`, `frp`, `fur`, `fy`, `ga`, `gag`, `gd`, `gl`, `glk`, `gn`, `gom`, `gu`, `gv`, `ha`, `hak`, `he`, `hi`, `hif`, `hr`, `hsb`, `ht`, `hu`, `hy`, `ia`, `id`, `ie`, `ig`, `ilo`, `io`, `is`, `it`, `ja`, `jam`, `jbo`, `jv`, `ka`, `kaa`, `kab`, `kbd`, `kk`, `km`, `kn`, `ko`, `koi`, `krc`, `ksh`, `ku`, `kv`, `kw`, `ky`, `la`, `lad`, `lb`, `lez`, `lg`, `li`, `lij`, `lmo`, `ln`, `lo`, `lrc`, `lt`, `ltg`, `lv`, `mai`, `map-bms`, `mdf`, `mg`, `mhr`, `mi`, `min`, `mk`, `ml`, `mn`, `mr`, `mrj`, `ms`, `mt`, `mwl`, `my`, `myv`, `mzn`, `nah`, `nap`, `nds`, `nds-nl`, `ne`, `new`, `nl`, `nn`, `no`, `nrm`, `nso`, `nv`, `oc`, `olo`, `om`, `or`, `os`, `pa`, `pag`, `pam`, `pap`, `pcd`, `pdc`, `pfl`, `pl`, `pnb`, `ps`, `pt`, `qu`, `rm`, `ro`, `roa-rup`, `roa-tara`, `ru`, `rue`, `rw`, `sa`, `sah`, `sc`, `scn`, `sd`, `se`, `sh`, `si`, `sk`, `sl`, `sn`, `so`, `sq`, `sr`, `srn`, `stq`, `su`, `sv`, `sw`, `szl`, `ta`, `tcy`, `te`, `tet`, `tg`, `th`, `tk`, `tl`, `tn`, `to`, `tr`, `tt`, `tyv`, `udm`, `ug`, `uk`, `ur`, `uz`, `vec`, `vep`, `vi`, `vls`, `vo`, `wa`, `war`, `wo`, `wuu`, `xh`, `xmf`, `yi`, `yo`, `zea`, `zh`, `zh-classical`, `zh-min-nan`, `zh-yue`.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_cnn_231_xx_2.7.0_2.4_1607183625658.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_wiki_cnn_231_xx_2.7.0_2.4_1607183625658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
language_detector = LanguageDetectorDL.pretrained("ld_wiki_cnn_231", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("language")
languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector])
light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text")))
result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.")
```
```scala
...
val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_cnn_231", "xx")
.setInputCols("sentence")
.setOutputCol("language")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector))
val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."]
lang_df = nlu.load('xx.classify.wiki_231').predict(text, output_level='sentence')
lang_df
```
## Results
```bash
'fr'
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ld_wiki_cnn_231|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[language]|
|Language:|xx|
## Data Source
Wikipedia
## Benchmarking
```bash
Evaluated on Europarl dataset which the model has never seen:
+--------+-----+-------+------------------+
|src_lang|count|correct| precision|
+--------+-----+-------+------------------+
| fr| 1000| 996| 0.996|
| fi| 1000| 995| 0.995|
| sv| 1000| 994| 0.994|
| en| 1000| 991| 0.991|
| pt| 1000| 988| 0.988|
| de| 1000| 986| 0.986|
| it| 1000| 982| 0.982|
| es| 1000| 977| 0.977|
| nl| 1000| 974| 0.974|
| lt| 1000| 969| 0.969|
| hu| 880| 850|0.9659090909090909|
| lv| 916| 884|0.9650655021834061|
| el| 1000| 965| 0.965|
| pl| 914| 882|0.9649890590809628|
| cs| 1000| 964| 0.964|
| da| 1000| 963| 0.963|
| et| 928| 892|0.9612068965517241|
| bg| 1000| 954| 0.954|
| sk| 1000| 945| 0.945|
| ro| 784| 738|0.9413265306122449|
| sl| 914| 850|0.9299781181619255|
+--------+-----+-------+------------------+
+-------+--------------------+
|summary| precision|
+-------+--------------------+
| count| 21|
| mean| 0.9700702474999693|
| stddev|0.018256955176991118|
| min| 0.9299781181619255|
| max| 0.996|
+-------+--------------------+
```
---
layout: model
title: English image_classifier_vit_upside_down_classifier ViTForImageClassification from daveni
author: John Snow Labs
name: image_classifier_vit_upside_down_classifier
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_upside_down_classifier` is a English model originally trained by daveni.
## Predicted Entities
`original`, `upside_down`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_upside_down_classifier_en_4.1.0_3.0_1660166128533.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_upside_down_classifier_en_4.1.0_3.0_1660166128533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_upside_down_classifier", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_upside_down_classifier", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_upside_down_classifier|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Telugu BertForMaskedLM Cased model (from neuralspace-reverie)
author: John Snow Labs
name: bert_embeddings_indic_transformers
date: 2022-12-02
tags: [te, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: te
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-te-bert` is a Telugu model originally trained by `neuralspace-reverie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_te_4.2.4_3.0_1670022427927.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_te_4.2.4_3.0_1670022427927.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","te") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","te")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_indic_transformers|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|te|
|Size:|611.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/neuralspace-reverie/indic-transformers-te-bert
- https://oscar-corpus.com/
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from skandaonsolve)
author: John Snow Labs
name: roberta_qa_finetuned_facility
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-facility` is a English model originally trained by `skandaonsolve`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_facility_en_4.3.0_3.0_1674220319873.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_facility_en_4.3.0_3.0_1674220319873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_facility","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_facility","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_finetuned_facility|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/skandaonsolve/roberta-finetuned-facility
---
layout: model
title: Named Entity Recognition Profiling (Clinical)
author: John Snow Labs
name: ner_profiling_clinical
date: 2023-05-04
tags: [licensed, en, clinical, profiling, ner_profiling, ner]
task: [Named Entity Recognition, Pipeline Healthcare]
language: en
edition: Healthcare NLP 4.4.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `embeddings_clinical`. It has been updated by adding new clinical NER models and NER model outputs to the previous version.
Here are the NER models that this pretrained pipeline includes:
`jsl_ner_wip_clinical`,`jsl_ner_wip_greedy_clinical`,`jsl_ner_wip_modifier_clinical`, `jsl_rd_ner_wip_greedy_clinical`, `ner_abbreviation_clinical`, `ner_ade_binary`, `ner_ade_clinical`, `ner_anatomy`, `ner_anatomy_coarse`, `ner_bacterial_species`, `ner_biomarker`, `ner_biomedical_bc2gm`, `ner_bionlp`, `ner_cancer_genetics`, `ner_cellular`, `ner_chemd_clinical`, `ner_chemicals`, `ner_chemprot_clinical`, `ner_chexpert`, `ner_clinical`, `ner_clinical_large`, `ner_clinical_trials_abstracts`, `ner_covid_trials`, `ner_deid_augmented`, `ner_deid_enriched`, `ner_deid_generic_augmented`,`ner_deid_large`, `ner_deid_sd`,`ner_deid_sd_large`,`ner_deid_subentity_augmented`,`ner_deid_subentity_augmented_i2b2`, `ner_deid_synthetic`, `ner_diseases`, `ner_diseases_large`, `ner_drugprot_clinical`, `ner_drugs`, `ner_drugs_greedy`, `ner_drugs_large`, `ner_eu_clinical_case`, `ner_eu_clinical_condition`, `ner_events_admission_clinical`, `ner_events_clinical`, `ner_financial_contract`, `ner_genetic_variants`, `ner_healthcare`, `ner_human_phenotype_gene_clinical`, `ner_human_phenotype_go_clinical`, `ner_jsl`, `ner_jsl_enriched`, `ner_jsl_slim`, `ner_living_species`, `ner_measurements_clinical`, `ner_medmentions_coarse`, `ner_nature_nero_clinical`, `ner_nihss`, `ner_oncology`, `ner_oncology_anatomy_general`, `ner_oncology_anatomy_granular`, `ner_oncology_biomarker`, `ner_oncology_demographics`, `ner_oncology_diagnosis`, `ner_oncology_posology`, `ner_oncology_response_to_treatment`, `ner_oncology_test`, `ner_oncology_therapy`, `ner_oncology_tnm`, `ner_oncology_unspecific_posology`, `ner_oncology_wip`, `ner_pathogen`, `ner_posology`, `ner_posology_experimental`, `ner_posology_greedy`, `ner_posology_large`, `ner_posology_small`, `ner_radiology`, `ner_radiology_wip_clinical`, `ner_risk_factors`, `ner_sdoh_access_to_healthcare_wip`, `ner_sdoh_community_condition_wip`, `ner_sdoh_demographics_wip`, `ner_sdoh_health_behaviours_problems_wip`, `ner_sdoh_income_social_status_wip`, `ner_sdoh_mentions`, `ner_sdoh_slim_wip`, `ner_sdoh_social_environment_wip`, `ner_sdoh_substance_usage_wip`, `ner_sdoh_wip`, `ner_supplement_clinical`, `ner_vop_anatomy_wip`, `ner_vop_clinical_dept_wip`, `ner_vop_demographic_wip`, `ner_vop_problem_reduced_wip`, `ner_vop_problem_wip`, `ner_vop_slim_wip`, `ner_vop_temporal_wip`, `ner_vop_test_wip`, `ner_vop_treatment_wip`, `ner_vop_wip`, `nerdl_tumour_demo`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_4.4.0_3.2_1683225723531.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_4.4.0_3.2_1683225723531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
ner_profiling_pipeline = PretrainedPipeline("ner_profiling_clinical", "en", "clinical/models")
result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_clinical", "en", "clinical/models")
val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.profiling_clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained('glove_6B_300', lang='xx') \
.setInputCols(['document', 'token']) \
.setOutputCol('embeddings')
ner_model = NerDLModel.pretrained("norne_6B_300", "no") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.']], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("norne_6B_300", "no")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella."""]
ner_df = nlu.load('no.ner.norne.glove.6B_300').predict(text, output_level = "chunk")
ner_df[["entities", "entities_confidence"]]
```
{:.h2_title}
## Results
```bash
+-------------------------------+---------+
|chunk |ner_label|
+-------------------------------+---------+
|William Henry Gates III |PER |
|Microsoft Corporation |ORG |
|Microsoft |ORG |
|Gates |PER |
|CEO |ORG |
|Seattle |GPE_LOC |
|Washington |GPE_LOC |
|Microsoft |ORG |
|Paul Allen |PER |
|Albuquerque |GPE_LOC |
|New Mexico |GPE_LOC |
|Gates |PER |
|Gates |PER |
|Gates |PER |
|Microsoft |ORG |
|Bill & Melinda Gates Foundation|ORG |
|Melinda Gates |PER |
|Ray Ozzie |PER |
|Craig Mundie |PER |
|Microsoft |ORG |
+-------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|norne_6B_300|
|Type:|ner|
|Compatibility:| Spark NLP 2.5.0+|
|Edition:|Official|
|License:|Open Source|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|no|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.559.pdf](https://www.aclweb.org/anthology/2020.lrec-1.559.pdf)
---
layout: model
title: Clinical Deidentification (Italian)
author: John Snow Labs
name: clinical_deidentification
date: 2023-06-13
tags: [deidentification, pipeline, it, licensed]
task: De-identification
language: it
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline can be used to deidentify PHI information from medical texts in Italian. The pipeline can mask and obfuscate the following entities: `DATE`, `AGE`, `SEX`, `PROFESSION`, `ORGANIZATION`, `PHONE`, `E-MAIL`, `ZIP`, `STREET`, `CITY`, `COUNTRY`, `PATIENT`, `DOCTOR`, `HOSPITAL`, `MEDICALRECORD`, `SSN`, `IDNUM`, `ACCOUNT`, `PLATE`, `USERNAME`, `URL`, and `IPADDR`.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_it_4.4.4_3.2_1686664266856.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_it_4.4.4_3.2_1686664266856.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "it", "clinical/models")
sample = """RAPPORTO DI RICOVERO
NOME: Lodovico Fibonacci
CODICE FISCALE: MVANSK92F09W408A
INDIRIZZO: Viale Burcardo 7
CITTÀ : Napoli
CODICE POSTALE: 80139
DATA DI NASCITA: 03/03/1946
ETÀ: 70 anni
SESSO: M
EMAIL: lpizzo@tim.it
DATA DI AMMISSIONE: 12/12/2016
DOTTORE: Eva Viviani
RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml.
INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli
EMAIL: bferrabosco@poste.it"""
result = deid_pipeline.annotate(sample)
```
```scala
val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "it", "clinical/models")
sample = "RAPPORTO DI RICOVERO
NOME: Lodovico Fibonacci
CODICE FISCALE: MVANSK92F09W408A
INDIRIZZO: Viale Burcardo 7
CITTÀ : Napoli
CODICE POSTALE: 80139
DATA DI NASCITA: 03/03/1946
ETÀ: 70 anni
SESSO: M
EMAIL: lpizzo@tim.it
DATA DI AMMISSIONE: 12/12/2016
DOTTORE: Eva Viviani
RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml.
INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli
EMAIL: bferrabosco@poste.it"
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("it.deid.clinical").predict("""RAPPORTO DI RICOVERO
NOME: Lodovico Fibonacci
CODICE FISCALE: MVANSK92F09W408A
INDIRIZZO: Viale Burcardo 7
CITTÀ : Napoli
CODICE POSTALE: 80139
DATA DI NASCITA: 03/03/1946
ETÀ: 70 anni
SESSO: M
EMAIL: lpizzo@tim.it
DATA DI AMMISSIONE: 12/12/2016
DOTTORE: Eva Viviani
RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml.
INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli
EMAIL: bferrabosco@poste.it""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "it", "clinical/models")
sample = """RAPPORTO DI RICOVERO
NOME: Lodovico Fibonacci
CODICE FISCALE: MVANSK92F09W408A
INDIRIZZO: Viale Burcardo 7
CITTÀ : Napoli
CODICE POSTALE: 80139
DATA DI NASCITA: 03/03/1946
ETÀ: 70 anni
SESSO: M
EMAIL: lpizzo@tim.it
DATA DI AMMISSIONE: 12/12/2016
DOTTORE: Eva Viviani
RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml.
INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli
EMAIL: bferrabosco@poste.it"""
result = deid_pipeline.annotate(sample)
```
```scala
val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "it", "clinical/models")
sample = "RAPPORTO DI RICOVERO
NOME: Lodovico Fibonacci
CODICE FISCALE: MVANSK92F09W408A
INDIRIZZO: Viale Burcardo 7
CITTÀ : Napoli
CODICE POSTALE: 80139
DATA DI NASCITA: 03/03/1946
ETÀ: 70 anni
SESSO: M
EMAIL: lpizzo@tim.it
DATA DI AMMISSIONE: 12/12/2016
DOTTORE: Eva Viviani
RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml.
INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli
EMAIL: bferrabosco@poste.it"
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("it.deid.clinical").predict("""RAPPORTO DI RICOVERO
NOME: Lodovico Fibonacci
CODICE FISCALE: MVANSK92F09W408A
INDIRIZZO: Viale Burcardo 7
CITTÀ : Napoli
CODICE POSTALE: 80139
DATA DI NASCITA: 03/03/1946
ETÀ: 70 anni
SESSO: M
EMAIL: lpizzo@tim.it
DATA DI AMMISSIONE: 12/12/2016
DOTTORE: Eva Viviani
RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml.
INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli
EMAIL: bferrabosco@poste.it""")
```
## Results
```bash
Results
Masked with entity labels
------------------------------
RAPPORTO DI RICOVERO
NOME:
CODICE FISCALE:
INDIRIZZO:
CITTÀ :
CODICE POSTALE:
DATA DI NASCITA:
ETÀ: anni
SESSO:
EMAIL:
DATA DI AMMISSIONE:
DOTTORE:
RAPPORTO CLINICO: anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali.
PSA di 1,16 ng/ml.
INDIRIZZATO A: Dott.
- , Dipartimento di Endocrinologia e Nutrizione - ,
EMAIL:
Masked with chars
------------------------------
RAPPORTO DI RICOVERO
NOME: [****************]
CODICE FISCALE: [**************]
INDIRIZZO: [**************]
CITTÀ : [****]
CODICE POSTALE: [***]DATA DI NASCITA: [********]
ETÀ: **anni
SESSO: *
EMAIL: [***********]
DATA DI AMMISSIONE: [********]
DOTTORE: [*********]
RAPPORTO CLINICO: **anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali.
PSA di 1,16 ng/ml.
INDIRIZZATO A: Dott.
[**************] - [*****************], Dipartimento di Endocrinologia e Nutrizione - [*******************], [***] [****]
EMAIL: [******************]
Masked with fixed length chars
------------------------------
RAPPORTO DI RICOVERO
NOME: ****
CODICE FISCALE: ****
INDIRIZZO: ****
CITTÀ : ****
CODICE POSTALE: ****DATA DI NASCITA: ****
ETÀ: **** anni
SESSO: ****
EMAIL: ****
DATA DI AMMISSIONE: ****
DOTTORE: ****
RAPPORTO CLINICO: **** anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali.
PSA di 1,16 ng/ml.
INDIRIZZATO A: Dott.
**** - ****, Dipartimento di Endocrinologia e Nutrizione - ****, **** ****
EMAIL: ****
Obfuscated
------------------------------
RAPPORTO DI RICOVERO
NOME: Scotto-Polani
CODICE FISCALE: ECI-QLN77G15L455Y
INDIRIZZO: Viale Orlando 808
CITTÀ : Sesto Raimondo
CODICE POSTALE: 53581DATA DI NASCITA: 09/03/1946
ETÀ: 5 anni
SESSO: U
EMAIL: HenryWatson@world.com
DATA DI AMMISSIONE: 10/01/2017
DOTTORE: Sig. Fredo Marangoni
RAPPORTO CLINICO: 5 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali.
PSA di 1,16 ng/ml.
INDIRIZZATO A: Dott.
Antonio Rusticucci - ASL 7 DI CARBONIA AZIENDA U.S.L. N. 7, Dipartimento di Endocrinologia e Nutrizione - Via Giorgio 0 Appartamento 26, 03461 Sesto Raimondo
EMAIL: murat.g@jsl.com
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|it|
|Size:|1.3 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- ContextualParserModel
- ContextualParserModel
- RegexMatcherModel
- RegexMatcherModel
- RegexMatcherModel
- RegexMatcherModel
- RegexMatcherModel
- RegexMatcherModel
- RegexMatcherModel
- RegexMatcherModel
- RegexMatcherModel
- RegexMatcherModel
- ChunkMergeModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- Finisher
---
layout: model
title: Pipeline to Mapping SNOMED Codes with Their Corresponding ICD10-CM Codes
author: John Snow Labs
name: snomed_icd10cm_mapping
date: 2022-06-27
tags: [pipeline, snomed, icd10cm, chunk_mapper, clinical, licensed, en]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 3.5.3
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of `snomed_icd10cm_mapper` model.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapping_en_3.5.3_3.0_1656363315439.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapping_en_3.5.3_3.0_1656363315439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline= PretrainedPipeline("snomed_icd10cm_mapping", "en", "clinical/models")
result= pipeline.fullAnnotate("128041000119107 292278006 293072005")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline= new PretrainedPipeline("snomed_icd10cm_mapping", "en", "clinical/models")
val result= pipeline.fullAnnotate("128041000119107 292278006 293072005")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.snomed_to_icd10cm.pipe").predict("""128041000119107 292278006 293072005""")
```
## Results
```bash
| | snomed_code | icd10cm_code |
|---:|:----------------------------------------|:---------------------------|
| 0 | 128041000119107 | 292278006 | 293072005 | K22.70 | T43.595 | T37.1X5 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|snomed_icd10cm_mapping|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.5.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.5 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ChunkMapperModel
---
layout: model
title: Translate Lingala to English Pipeline
author: John Snow Labs
name: translate_ln_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, ln, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `ln`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ln_en_xx_2.7.0_2.4_1609698622095.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ln_en_xx_2.7.0_2.4_1609698622095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_ln_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_ln_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.ln.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_ln_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Dutch DistilBERT Embeddings (from Geotrend)
author: John Snow Labs
name: distilbert_embeddings_distilbert_base_nl_cased
date: 2022-04-12
tags: [distilbert, embeddings, nl, open_source]
task: Embeddings
language: nl
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: DistilBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-nl-cased` is a Dutch model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_nl_cased_nl_3.4.2_3.0_1649783996172.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_nl_cased_nl_3.4.2_3.0_1649783996172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_nl_cased","nl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_nl_cased","nl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ik hou van vonk nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("nl.embed.distilbert_base_cased").predict("""Ik hou van vonk nlp""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_embeddings_distilbert_base_nl_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|nl|
|Size:|229.3 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/distilbert-base-nl-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Legal Agreement and Plan of Reorganization Document Classifier (Bert Sentence Embeddings)
author: John Snow Labs
name: legclf_agreement_and_plan_of_reorganization_bert
date: 2022-12-06
tags: [en, legal, classification, agreement, plan, reorganizationlicensed, bert, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_agreement_and_plan_of_reorganization_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `agreement-and-plan-of-reorganization` or not (Binary Classification).
Unlike the Longformer model, this model is lighter in terms of inference time.
## Predicted Entities
`agreement-and-plan-of-reorganization`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_plan_of_reorganization_bert_en_1.0.0_3.0_1670349241846.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_plan_of_reorganization_bert_en_1.0.0_3.0_1670349241846.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[agreement-and-plan-of-reorganization]|
|[other]|
|[other]|
|[agreement-and-plan-of-reorganization]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_agreement_and_plan_of_reorganization_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
agreement-and-plan-of-reorganization 1.00 1.00 1.00 31
other 1.00 1.00 1.00 35
accuracy - - 1.00 66
macro-avg 1.00 1.00 1.00 66
weighted-avg 1.00 1.00 1.00 66
```
---
layout: model
title: Arabic BertForMaskedLM Base Cased model (from aubmindlab)
author: John Snow Labs
name: bert_embeddings_base_arabertv2
date: 2022-12-02
tags: [ar, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ar
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabertv2` is a Arabic model originally trained by `aubmindlab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabertv2_ar_4.2.4_3.0_1670015875947.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabertv2_ar_4.2.4_3.0_1670015875947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabertv2","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabertv2","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_arabertv2|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|507.6 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/aubmindlab/bert-base-arabertv2
- https://github.com/google-research/bert
- https://arxiv.org/abs/2003.00104
- https://github.com/WissamAntoun/pydata_khobar_meetup
- http://alt.qcri.org/farasa/segmenter.html
- /aubmindlab/bert-base-arabertv2/blob/main/(https://github.com/google-research/bert/blob/master/multilingual.md)
- https://github.com/elnagara/HARD-Arabic-Dataset
- https://www.aclweb.org/anthology/D15-1299
- https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf
- https://github.com/mohamedadaly/LABR
- http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp
- https://github.com/husseinmozannar/SOQAL
- https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md
- https://arxiv.org/abs/2003.00104v2
- https://archive.org/details/arwiki-20190201
- https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4
- https://www.aclweb.org/anthology/W19-4619
- https://sites.aub.edu.lb/mindlab/
- https://www.yakshof.com/#/
- https://www.behance.net/rahalhabib
- https://www.linkedin.com/in/wissam-antoun-622142b4/
- https://twitter.com/wissam_antoun
- https://github.com/WissamAntoun
- https://www.linkedin.com/in/fadybaly/
- https://twitter.com/fadybaly
- https://github.com/fadybaly
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6_en_4.3.0_3.0_1674214999722.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6_en_4.3.0_3.0_1674214999722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|426.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-6
---
layout: model
title: English DistilBertForQuestionAnswering model (from huxxx657)
author: John Snow Labs
name: distilbert_qa_huxxx657_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `huxxx657`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_huxxx657_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725541025.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_huxxx657_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725541025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huxxx657_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huxxx657_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_huxxx657").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_huxxx657_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/huxxx657/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Pipeline to Detect Genes and Human Phenotypes (biobert)
author: John Snow Labs
name: ner_human_phenotype_gene_biobert_pipeline
date: 2023-03-20
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_human_phenotype_gene_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_human_phenotype_gene_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_biobert_pipeline_en_4.3.0_3.2_1679315678860.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_biobert_pipeline_en_4.3.0_3.2_1679315678860.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_human_phenotype_gene_biobert_pipeline", "en", "clinical/models")
text = '''Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_human_phenotype_gene_biobert_pipeline", "en", "clinical/models")
val text = "Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3)."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.human_phenotype_gene_biobert.pipeline").predict("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:-----------------|--------:|------:|:------------|-------------:|
| 0 | type | 29 | 32 | GENE | 0.9977 |
| 1 | polyhydramnios | 75 | 88 | HP | 0.9949 |
| 2 | polyuria | 91 | 98 | HP | 0.9955 |
| 3 | nephrocalcinosis | 101 | 116 | HP | 0.995 |
| 4 | hypokalemia | 122 | 132 | HP | 0.9986 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_human_phenotype_gene_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.2 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Classify first part of agreements (Parties, Agreement type)
author: John Snow Labs
name: legclf_introduction_clause_cuad
date: 2022-11-25
tags: [en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the first part of a legal agreement, were `PARTIES`, `AGREEMENT TYPE` and `ALIASES` or `ROLES` of the parties are described. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`introduction`, `other`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_introduction_clause_cuad_en_1.0.0_3.0_1669371916085.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_introduction_clause_cuad_en_1.0.0_3.0_1669371916085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[introduction]|
|[other]|
|[other]|
|[introduction]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_introduction_clause_cuad|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
introduction 1.00 0.98 0.99 99
other 0.99 1.00 0.99 151
accuracy - - 0.99 250
macro-avg 0.99 0.99 0.99 250
weighted--avg 0.99 0.99 0.99 250
```
---
layout: model
title: Extract Entities Related to TNM Staging
author: John Snow Labs
name: ner_oncology_tnm
date: 2022-11-24
tags: [licensed, en, clinical, oncology, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts staging information and mentions related to tumors, lymph nodes and metastases.
Definitions of Predicted Entities:
- `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction.
- `Lymph_Node`: Mentions of lymph nodes and pathological findings of the lymph nodes.
- `Lymph_Node_Modifier`: Words that refer to a lymph node being abnormal (such as "enlargement").
- `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions.
- `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced".
- `Tumor`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm").
- `Tumor_Description`: Information related to tumor characteristics, such as size, presence of invasion, grade and hystological type.
## Predicted Entities
`Cancer_Dx`, `Lymph_Node`, `Lymph_Node_Modifier`, `Metastasis`, `Staging`, `Tumor`, `Tumor_Description`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_tnm_en_4.2.2_3.0_1669308699155.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_tnm_en_4.2.2_3.0_1669308699155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_tnm", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_tnm", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_tnm").predict("""The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.""")
```
## Results
```bash
| chunk | ner_label |
|:-----------------|:------------------|
| metastatic | Metastasis |
| breast carcinoma | Cancer_Dx |
| T2N1M1 stage IV | Staging |
| 4 cm | Tumor_Description |
| tumor | Tumor |
| grade 2 | Tumor_Description |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_tnm|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|34.2 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Lymph_Node 570 77 77 647 0.88 0.88 0.88
Staging 232 22 26 258 0.91 0.90 0.91
Lymph_Node_Modifier 30 5 5 35 0.86 0.86 0.86
Tumor_Description 2651 581 490 3141 0.82 0.84 0.83
Tumor 1116 72 141 1257 0.94 0.89 0.91
Metastasis 358 15 12 370 0.96 0.97 0.96
Cancer_Dx 1302 87 92 1394 0.94 0.93 0.94
macro_avg 6259 859 843 7102 0.90 0.90 0.90
micro_avg 6259 859 843 7102 0.88 0.88 0.88
```
---
layout: model
title: English BertForQuestionAnswering model (from aodiniz)
author: John Snow Labs
name: bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-512_A-8_squad2_covid-qna` is a English model orginally trained by `aodiniz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna_en_4.0.0_3.0_1654185314705.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna_en_4.0.0_3.0_1654185314705.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_covid.bert.uncased_4l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|107.2 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/aodiniz/bert_uncased_L-4_H-512_A-8_squad2_covid-qna
---
layout: model
title: English asr_wav2vec2_base_demo_colab_by_thyagosme TFWav2Vec2ForCTC from thyagosme
author: John Snow Labs
name: asr_wav2vec2_base_demo_colab_by_thyagosme
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_demo_colab_by_thyagosme` is a English model originally trained by thyagosme.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_demo_colab_by_thyagosme_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_demo_colab_by_thyagosme_en_4.2.0_3.0_1664107996154.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_demo_colab_by_thyagosme_en_4.2.0_3.0_1664107996154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_demo_colab_by_thyagosme", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_demo_colab_by_thyagosme", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_demo_colab_by_thyagosme|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|354.9 MB|
---
layout: model
title: Voice of the Patients (embeddings_clinical_medium)
author: John Snow Labs
name: ner_vop_emb_clinical_medium_wip
date: 2023-04-12
tags: [licensed, clinical, en, ner, vop, patient]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.0
spark_version: [3.0, 3.2]
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences.
Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases.
## Predicted Entities
`Allergen`, `SubstanceQuantity`, `RaceEthnicity`, `Measurements`, `InjuryOrPoisoning`, `Treatment`, `Modifier`, `TestResult`, `MedicalDevice`, `Vaccine`, `Frequency`, `HealthStatus`, `Route`, `RelationshipStatus`, `Procedure`, `Duration`, `DateTime`, `AdmissionDischarge`, `Disease`, `Test`, `Substance`, `Laterality`, `Symptom`, `ClinicalDept`, `Dosage`, `Age`, `Drug`, `VitalTest`, `PsychologicalCondition`, `Form`, `BodyPart`, `Employment`, `Gender`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/VOICE_OF_THE_PATIENTS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_emb_clinical_medium_wip_en_4.4.0_3.0_1681315530573.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_emb_clinical_medium_wip_en_4.4.0_3.0_1681315530573.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical_medium, "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_emb_clinical_medium_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical_medium, "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_emb_clinical_medium_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:---------------------|:-----------------------|
| 20 year old | Age |
| girl | Gender |
| hyperthyroid | Disease |
| 1 month ago | DateTime |
| weak | Symptom |
| light | Symptom |
| panic attacks | PsychologicalCondition |
| depression | PsychologicalCondition |
| left | Laterality |
| chest | BodyPart |
| pain | Symptom |
| increased | TestResult |
| heart rate | VitalTest |
| rapidly | Modifier |
| weight loss | Symptom |
| 4 months | Duration |
| hospital | ClinicalDept |
| discharged | AdmissionDischarge |
| hospital | ClinicalDept |
| blood tests | Test |
| brain | BodyPart |
| mri | Test |
| ultrasound scan | Test |
| endoscopy | Procedure |
| doctors | Employment |
| homeopathy doctor | Employment |
| he | Gender |
| hyperthyroid | Disease |
| TSH | Test |
| 0.15 | TestResult |
| T3 | Test |
| T4 | Test |
| normal | TestResult |
| b12 deficiency | Disease |
| vitamin D deficiency | Disease |
| weekly | Frequency |
| supplement | Drug |
| vitamin D | Drug |
| 1000 mcg | Dosage |
| b12 | Drug |
| daily | Frequency |
| homeopathy medicine | Treatment |
| 40 days | Duration |
| after 30 days | DateTime |
| TSH | Test |
| 0.5 | TestResult |
| now | DateTime |
| weakness | Symptom |
| depression | PsychologicalCondition |
| last week | DateTime |
| rapid | TestResult |
| heartrate | VitalTest |
| allopathy medicine | Treatment |
| homeopathy | Treatment |
| thyroid | BodyPart |
| allopathy | Treatment |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_emb_clinical_medium_wip|
|Compatibility:|Healthcare NLP 4.4.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.9 MB|
|Dependencies:|embeddings_clinical_medium|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Allergen 0 1 8 8 0.00 0.00 0.00
SubstanceQuantity 9 14 18 27 0.39 0.33 0.36
RaceEthnicity 2 0 6 8 1.00 0.25 0.40
Measurements 41 25 33 74 0.62 0.55 0.59
InjuryOrPoisoning 66 37 51 117 0.64 0.56 0.60
Treatment 96 39 46 142 0.71 0.68 0.69
Modifier 642 268 271 913 0.71 0.70 0.70
TestResult 394 185 154 548 0.68 0.72 0.70
MedicalDevice 177 76 67 244 0.70 0.73 0.71
Vaccine 20 4 12 32 0.83 0.63 0.71
Frequency 456 144 187 643 0.76 0.71 0.73
HealthStatus 60 4 38 98 0.94 0.61 0.74
Route 24 4 12 36 0.86 0.67 0.75
RelationshipStatus 19 3 9 28 0.86 0.68 0.76
Procedure 286 91 80 366 0.76 0.78 0.77
Duration 846 227 269 1115 0.79 0.76 0.77
DateTime 1813 455 391 2204 0.80 0.82 0.81
AdmissionDischarge 19 1 8 27 0.95 0.70 0.81
Disease 1247 318 256 1503 0.80 0.83 0.81
Test 734 150 175 909 0.83 0.81 0.82
Substance 156 48 22 178 0.76 0.88 0.82
Laterality 440 91 78 518 0.83 0.85 0.84
Symptom 3069 566 630 3699 0.84 0.83 0.84
ClinicalDept 205 35 31 236 0.85 0.87 0.86
Dosage 273 42 49 322 0.87 0.85 0.86
Age 294 60 29 323 0.83 0.91 0.87
Drug 1035 188 100 1135 0.85 0.91 0.88
VitalTest 144 23 13 157 0.86 0.92 0.89
PsychologicalCondition 284 32 30 314 0.90 0.90 0.90
Form 234 32 17 251 0.88 0.93 0.91
BodyPart 2532 256 213 2745 0.91 0.92 0.92
Employment 980 65 62 1042 0.94 0.94 0.94
Gender 1174 27 20 1194 0.98 0.98 0.98
macro_avg 17771 3511 3385 21156 0.79 0.73 0.75
micro_avg 17771 3511 3385 21156 0.84 0.84 0.84
```
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from amitjohn007)
author: John Snow Labs
name: roberta_qa_amitjohn007_base_finetuned_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `amitjohn007`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_amitjohn007_base_finetuned_squad_en_4.3.0_3.0_1674217120272.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_amitjohn007_base_finetuned_squad_en_4.3.0_3.0_1674217120272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_amitjohn007_base_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_amitjohn007_base_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_amitjohn007_base_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/amitjohn007/roberta-base-finetuned-squad
---
layout: model
title: Named Entity Recognition (NER) Model in Danish (Dane 840B 300)
author: John Snow Labs
name: dane_ner_840B_300
date: 2020-08-30
task: Named Entity Recognition
language: da
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [ner, da, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Norne is a Named Entity Recognition (or NER) model of Danish, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Dane NER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline.
{:.h2_title}
## Predicted Entities
Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_DA/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dane_ner_840B_300_da_2.6.0_2.4_1598810268070.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dane_ner_840B_300_da_2.6.0_2.4_1598810268070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang = "xx") \
.setInputCols(['document', 'token']) \
.setOutputCol('embeddings')
ner_model = NerDLModel.pretrained("dane_ner_840B_300", "da") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af \u200b\u200bde mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970'erne og 1980'erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af \u200b\u200b1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella."]], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang = "xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("dane_ner_840B_300", "da")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af de mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970"erne og 1980"erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af 1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af de mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970'erne og 1980'erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af 1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella."""]
ner_df = nlu.load('da.ner.840B300D').predict(text, output_level = "chunk")
ner_df[["entities", "entities_confidence"]]
```
{:.h2_title}
## Results
```bash
+-------------------------------+---------+
|chunk |ner_label|
+-------------------------------+---------+
|William Henry Gates |PER |
|amerikansk |MISC |
|Microsoft Corporation |ORG |
|Microsoft |ORG |
|Gates |PER |
|1970'erne |MISC |
|1980'erne |MISC |
|Seattle |LOC |
|Washington |LOC |
|Gates |PER |
|Microsoft |ORG |
|Paul Allen |PER |
|Albuquerque |LOC |
|New Mexico |LOC |
|1990'erne |MISC |
|Gates |MISC |
|Gates |PER |
|Microsoft |ORG |
|Bill & Melinda Gates Foundation|PER |
|Melinda Gates |PER |
+-------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|dane_ner_840B_300|
|Type:|ner|
|Compatibility:| Spark NLP 2.6.0+|
|Edition:|Official|
|License:|Open Source|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|da|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.565.pdf](https://www.aclweb.org/anthology/2020.lrec-1.565.pdf)
---
layout: model
title: Legal Parties in interest Clause Binary Classifier
author: John Snow Labs
name: legclf_parties_in_interest_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `parties-in-interest` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `parties-in-interest`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_parties_in_interest_clause_en_1.0.0_3.2_1660122826311.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_parties_in_interest_clause_en_1.0.0_3.2_1660122826311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[parties-in-interest]|
|[other]|
|[other]|
|[parties-in-interest]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_parties_in_interest_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.99 0.99 0.99 106
parties-in-interest 0.98 0.98 0.98 49
accuracy - - 0.99 155
macro-avg 0.99 0.99 0.99 155
weighted-avg 0.99 0.99 0.99 155
```
---
layout: model
title: Longformer Base NER Pipeline
author: John Snow Labs
name: longformer_base_token_classifier_conll03_pipeline
date: 2022-06-19
tags: [ner, longformer, pipeline, conll, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [longformer_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/09/longformer_base_token_classifier_conll03_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653913352.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653913352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("longformer_base_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I am working at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("longformer_base_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I am working at John Snow Labs.")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PER |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|longformer_base_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|516.0 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- LongformerForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: English image_classifier_vit_test ViTForImageClassification from flyswot
author: John Snow Labs
name: image_classifier_vit_test
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_test` is a English model originally trained by flyswot.
## Predicted Entities
`EDGE + SPINE`, `OTHER`, `PAGE + FOLIO`, `FLYSHEET`, `CONTAINER`, `CONTROL SHOT`, `COVER`, `SCROLL`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_test_en_4.1.0_3.0_1660169623877.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_test_en_4.1.0_3.0_1660169623877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_test", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_test", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_test|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.4 MB|
---
layout: model
title: Extract treatment entities (Voice of the Patients)
author: John Snow Labs
name: ner_vop_treatment_wip
date: 2023-04-20
tags: [licensed, clinical, en, ner, vop, patient, treatment]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts treatments mentioned in documents transferred from the patient’s own sentences.
Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases.
## Predicted Entities
`Treatment`, `Frequency`, `Procedure`, `Route`, `Duration`, `Dosage`, `Drug`, `Form`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_wip_en_4.4.0_3.0_1682013186202.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_wip_en_4.4.0_3.0_1682013186202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_treatment_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It"s been a bit of an adjustment, but he"s doing well."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_treatment_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It"s been a bit of an adjustment, but he"s doing well.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:----------|:------------|
| metformin | Drug |
| glipizide | Drug |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_treatment_wip|
|Compatibility:|Healthcare NLP 4.4.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.9 MB|
|Dependencies:|embeddings_clinical|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Treatment 144 44 67 211 0.77 0.68 0.72
Frequency 801 111 271 1072 0.88 0.75 0.81
Procedure 505 104 112 617 0.83 0.82 0.82
Route 29 3 8 37 0.91 0.78 0.84
Duration 1926 382 345 2271 0.83 0.85 0.84
Dosage 350 56 49 399 0.86 0.88 0.87
Drug 1210 125 108 1318 0.91 0.92 0.91
Form 235 23 18 253 0.91 0.93 0.92
macro_avg 5200 848 978 6178 0.86 0.83 0.84
micro_avg 5200 848 978 6178 0.86 0.84 0.85
```
---
layout: model
title: Adverse Drug Events Classifier (LogReg)
author: John Snow Labs
name: classifier_logreg_ade
date: 2023-05-11
tags: [text_classification, ade, clinical, licensed, logreg, en]
task: Text Classification
language: en
edition: Healthcare NLP 4.4.1
spark_version: 3.0
supported: true
annotator: DocumentLogRegClassifierModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained with the Logistic Regression algorithm and classifies text/sentence into two categories:
True : The sentence is talking about a possible ADE
False : The sentence doesn’t have any information about an ADE.
The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False).
## Predicted Entities
`True`, `False`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifier_logreg_ade_en_4.4.1_3.0_1683817451286.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifier_logreg_ade_en_4.4.1_3.0_1683817451286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
logreg = DocumentLogRegClassifierModel.pretrained("classifier_logreg_ade", "en", "clinical/models")\
.setInputCols("token")\
.setOutputCol("prediction")
clf_Pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
logreg])
data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text")
result = clf_Pipeline.fit(data).transform(data)
```
```scala
val document_assembler =new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val logreg = new DocumentLogRegClassifierModel.pretrained("classifier_logreg_ade", "en", "clinical/models")
.setInputCols("token")
.setOutputCol("prediction")
val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, logreg))
val data = Seq(Array("None of the patients required treatment for the overdose.", "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.")).toDS().toDF("text")
val result = clf_Pipeline.fit(data).transform(data)
```
## Results
```bash
+----------------------------------------------------------------------------------------+-------+
|text |result |
+----------------------------------------------------------------------------------------+-------+
|Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[True] |
|None of the patients required treatment for the overdose. |[False]|
+----------------------------------------------------------------------------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifier_logreg_ade|
|Compatibility:|Healthcare NLP 4.4.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[prediction]|
|Language:|en|
|Size:|595.7 KB|
## References
The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False).
Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615
## Benchmarking
```bash
label precision recall f1-score support
False 0.91 0.92 0.92 3362
True 0.79 0.79 0.79 1361
accuracy - - 0.88 4723
macro_avg 0.85 0.85 0.85 4723
weighted_avg 0.88 0.88 0.88 4723
```
---
layout: model
title: English asr_english_filipino_wav2vec2_l_xls_r_test_09 TFWav2Vec2ForCTC from Khalsuu
author: John Snow Labs
name: pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_09` is a English model originally trained by Khalsuu.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09_en_4.2.0_3.0_1664119422310.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09_en_4.2.0_3.0_1664119422310.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: German BertForMaskedLM Large Cased model (from deepset)
author: John Snow Labs
name: bert_embeddings_g_large
date: 2022-12-02
tags: [de, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: de
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gbert-large` is a German model originally trained by `deepset`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_large_de_4.2.4_3.0_1670022204873.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_large_de_4.2.4_3.0_1670022204873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_large","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_large","de")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_g_large|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|de|
|Size:|1.3 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/deepset/gbert-large
- https://arxiv.org/pdf/2010.10906.pdf
- https://arxiv.org/pdf/2010.10906.pdf
- http://deepset.ai/
- https://haystack.deepset.ai/
- https://deepset.ai/german-bert
- https://deepset.ai/germanquad
- https://github.com/deepset-ai/haystack
- https://docs.haystack.deepset.ai
- https://haystack.deepset.ai/community
- https://twitter.com/deepset_ai
- https://www.linkedin.com/company/deepset-ai/
- https://haystack.deepset.ai/community
- https://github.com/deepset-ai/haystack/discussions
- https://deepset.ai
- http://www.deepset.ai/jobs
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_4_h_512
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-512` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_512_zh_4.2.4_3.0_1670021676308.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_512_zh_4.2.4_3.0_1670021676308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_512","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_512","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_4_h_512|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|90.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-4_H-512
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: English BertForQuestionAnswering model (from Vasanth)
author: John Snow Labs
name: bert_qa_bert_base_uncased_qa_squad2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-qa-squad2` is a English model orginally trained by `Vasanth`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_qa_squad2_en_4.0.0_3.0_1654181282352.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_qa_squad2_en_4.0.0_3.0_1654181282352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_qa_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_uncased_qa_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert.base_uncased.by_Vasanth").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_uncased_qa_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Vasanth/bert-base-uncased-qa-squad2
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk TFWav2Vec2ForCTC from krirk
author: John Snow Labs
name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk` is a English model originally trained by krirk.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_en_4.2.0_3.0_1664042673180.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_en_4.2.0_3.0_1664042673180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: English AlbertForQuestionAnswering model (from rowan1224)
author: John Snow Labs
name: albert_qa_slp
date: 2022-06-24
tags: [en, open_source, albert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: AlBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-slp` is a English model originally trained by `rowan1224`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_slp_en_4.0.0_3.0_1656063737900.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_slp_en_4.0.0_3.0_1656063737900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_slp","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_slp","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.albert.by_rowan1224").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_qa_slp|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|42.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/rowan1224/albert-slp
---
layout: model
title: Fast Neural Machine Translation Model from English to Italic Languages
author: John Snow Labs
name: opus_mt_en_itc
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, itc, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `itc`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_itc_xx_2.7.0_2.4_1609170676219.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_itc_xx_2.7.0_2.4_1609170676219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_itc", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_itc", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.itc').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_itc|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from saraks)
author: John Snow Labs
name: distilbert_qa_cuad_parties_08_25
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-parties-08-25` is a English model originally trained by `saraks`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_parties_08_25_en_4.3.0_3.0_1672766229642.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_parties_08_25_en_4.3.0_3.0_1672766229642.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_parties_08_25","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_parties_08_25","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_cuad_parties_08_25|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/saraks/cuad-distil-parties-08-25
---
layout: model
title: Multilingual DistilBertForQuestionAnswering model (from ZYW) Squad
author: John Snow Labs
name: distilbert_qa_squad_en_de_es_vi_zh_model
date: 2022-06-08
tags: [en, de, vi, zh, es, open_source, distilbert, question_answering, xx]
task: Question Answering
language: xx
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-en-de-es-vi-zh-model` is a English model originally trained by `ZYW`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_en_de_es_vi_zh_model_xx_4.0.0_3.0_1654728800543.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_en_de_es_vi_zh_model_xx_4.0.0_3.0_1654728800543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_en_de_es_vi_zh_model","xx") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_en_de_es_vi_zh_model","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("xx.answer_question.squad.distil_bert._en_de_es_vi_zh_tuned.by_ZYW").predict("""PUT YOUR QUESTION HERE|||"PUT YOUR CONTEXT HERE""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_squad_en_de_es_vi_zh_model|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|xx|
|Size:|505.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ZYW/squad-en-de-es-vi-zh-model
---
layout: model
title: Fast and Accurate Language Identification - 220 Languages (CNN)
author: John Snow Labs
name: ld_wiki_tatoeba_cnn_220
date: 2020-12-05
task: Language Detection
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [language_detection, open_source, xx]
supported: true
annotator: LanguageDetectorDL
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate.
We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The model is trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias).
This model can detect the following languages:
`Achinese`, `Afrikaans`, `Tosk Albanian`, `Amharic`, `Aragonese`, `Old English`, `Arabic`, `Egyptian Arabic`, `Assamese`, `Asturian`, `Avaric`, `Aymara`, `Azerbaijani`, `South Azerbaijani`, `Bashkir`, `Bavarian`, `bat-smg`, `Central Bikol`, `Belarusian`, `Bulgarian`, `bh`, `Bengali`, `Tibetan`, `Bishnupriya`, `Breton`, `Russia Buriat`, `Catalan`, `Min Dong Chinese`, `Chechen`, `Cebuano`, `Central Kurdish (Soranî)`, `Corsican`, `Crimean Tatar`, `Czech`, `Kashubian`, `Chuvash`, `Welsh`, `Danish`, `German`, `Dimli (individual language)`, `Lower Sorbian`, `Dhivehi`, `Greek`, `eml`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Basque`, `Extremaduran`, `Persian`, `Finnish`, `fiu-vro`, `Faroese`, `French`, `Arpitan`, `Friulian`, `Frisian`, `Irish`, `Gagauz`, `Scottish Gaelic`, `Galician`, `Guarani`, `Konkani (Goan)`, `Gujarati`, `Manx`, `Hausa`, `Hakka Chinese`, `Hebrew`, `Hindi`, `Fiji Hindi`, `Upper Sorbian`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Indonesian`, `Interlingue`, `Igbo`, `Ilocano`, `Ido`, `Icelandic`, `Italian`, `Japanese`, `Jamaican Patois`, `Lojban`, `Javanese`, `Georgian`, `Karakalpak`, `Kabyle`, `Kabardian`, `Kazakh`, `Khmer`, `Kannada`, `Korean`, `Komi-Permyak`, `Karachay-Balkar`, `Kölsch`, `Kurdish`, `Komi`, `Cornish`, `Kyrgyz`, `Latin`, `Ladino`, `Luxembourgish`, `Lezghian`, `Luganda`, `Limburgan`, `Ligurian`, `Lombard`, `Lingala`, `Lao`, `Northern Luri`, `Lithuanian`, `Latvian`, `Maithili`, `map-bms`, `Malagasy`, `Meadow Mari`, `Maori`, `Minangkabau`, `Macedonian`, `Malayalam`, `Mongolian`, `Marathi`, `Hill Mari`, `Maltese`, `Mirandese`, `Burmese`, `Erzya`, `Mazanderani`, `Nahuatl`, `Neapolitan`, `Low German (Low Saxon)`, `nds-nl`, `Nepali`, `Newari`, `Dutch`, `Norwegian Nynorsk`, `Norwegian`, `Narom`, `Pedi`, `Navajo`, `Occitan`, `Livvi`, `Oromo`, `Odia (Oriya)`, `Ossetian`, `Punjabi (Eastern)`, `Pangasinan`, `Kapampangan`, `Papiamento`, `Picard`, `Palatine German`, `Polish`, `Punjabi (Western)`, `Pashto`, `Portuguese`, `Quechua`, `Romansh`, `Romanian`, `roa-tara`, `Russian`, `Rusyn`, `Kinyarwanda`, `Sanskrit`, `Yakut`, `Sardinian`, `Sicilian`, `Scots`, `Sindhi`, `Northern Sami`, `Sinhala`, `Slovak`, `Slovenian`, `Shona`, `Somali`, `Albanian`, `Serbian`, `Saterland Frisian`, `Sundanese`, `Swedish`, `Swahili`, `Silesian`, `Tamil`, `Tulu`, `Telugu`, `Tetun`, `Tajik`, `Thai`, `Turkmen`, `Tagalog`, `Setswana`, `Tongan`, `Turkish`, `Tatar`, `Tuvinian`, `Udmurt`, `Uyghur`, `Ukrainian`, `Urdu`, `Uzbek`, `Venetian`, `Veps`, `Vietnamese`, `Vlaams`, `Volapük`, `Walloon`, `Waray`, `Wolof`, `Shanghainese`, `Xhosa`, `Mingrelian`, `Yiddish`, `Yoruba`, `Zeeuws`, `Chinese`, `zh-classical`, `zh-min-nan`, `zh-yue`.
## Predicted Entities
`ace`, `af`, `als`, `am`, `an`, `ang`, `ar`, `arz`, `as`, `ast`, `av`, `ay`, `az`, `azb`, `ba`, `bar`, `bat-smg`, `bcl`, `be`, `bg`, `bh`, `bn`, `bo`, `bpy`, `br`, `bxr`, `ca`, `cdo`, `ce`, `ceb`, `ckb`, `co`, `crh`, `cs`, `csb`, `cv`, `cy`, `da`, `de`, `diq`, `dsb`, `dv`, `el`, `eml`, `en`, `eo`, `es`, `et`, `eu`, `ext`, `fa`, `fi`, `fiu-vro`, `fo`, `fr`, `frp`, `fur`, `fy`, `ga`, `gag`, `gd`, `gl`, `gn`, `gom`, `gu`, `gv`, `ha`, `hak`, `he`, `hi`, `hif`, `hsb`, `ht`, `hu`, `hy`, `ia`, `id`, `ie`, `ig`, `ilo`, `io`, `is`, `it`, `ja`, `jam`, `jbo`, `jv`, `ka`, `kaa`, `kab`, `kbd`, `kk`, `km`, `kn`, `ko`, `koi`, `krc`, `ksh`, `ku`, `kv`, `kw`, `ky`, `la`, `lad`, `lb`, `lez`, `lg`, `li`, `lij`, `lmo`, `ln`, `lo`, `lrc`, `lt`, `lv`, `mai`, `map-bms`, `mg`, `mhr`, `mi`, `min`, `mk`, `ml`, `mn`, `mr`, `mrj`, `mt`, `mwl`, `my`, `myv`, `mzn`, `nah`, `nap`, `nds`, `nds-nl`, `ne`, `new`, `nl`, `nn`, `no`, `nrm`, `nso`, `nv`, `oc`, `olo`, `om`, `or`, `os`, `pa`, `pag`, `pam`, `pap`, `pcd`, `pfl`, `pl`, `pnb`, `ps`, `pt`, `qu`, `rm`, `ro`, `roa-tara`, `ru`, `rue`, `rw`, `sa`, `sah`, `sc`, `scn`, `sco`, `sd`, `se`, `si`, `sk`, `sl`, `sn`, `so`, `sq`, `sr`, `stq`, `su`, `sv`, `sw`, `szl`, `ta`, `tcy`, `te`, `tet`, `tg`, `th`, `tk`, `tl`, `tn`, `to`, `tr`, `tt`, `tyv`, `udm`, `ug`, `uk`, `ur`, `uz`, `vec`, `vep`, `vi`, `vls`, `vo`, `wa`, `war`, `wo`, `wuu`, `xh`, `xmf`, `yi`, `yo`, `zea`, `zh`, `zh-classical`, `zh-min-nan`, `zh-yue`.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_220_xx_2.7.0_2.4_1607184539094.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_220_xx_2.7.0_2.4_1607184539094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
language_detector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_220", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("language")
languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector])
light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text")))
result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.")
```
```scala
...
val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_220", "xx")
.setInputCols("sentence")
.setOutputCol("language")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector))
val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."]
lang_df = nlu.load('xx.classify.wiki_220').predict(text, output_level='sentence')
lang_df
```
## Results
```bash
'fr'
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ld_wiki_tatoeba_cnn_220|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[language]|
|Language:|xx|
## Data Source
Wikipedia and Tatoeba
## Benchmarking
```bash
Evaluated on Europarl dataset which the model has never seen:
+--------+-----+-------+------------------+
|src_lang|count|correct| precision|
+--------+-----+-------+------------------+
| sv| 1000| 999| 0.999|
| fr| 1000| 999| 0.999|
| fi| 1000| 998| 0.998|
| it| 1000| 997| 0.997|
| pt| 1000| 995| 0.995|
| el| 1000| 994| 0.994|
| de| 1000| 993| 0.993|
| en| 1000| 990| 0.99|
| nl| 1000| 987| 0.987|
| hu| 880| 866|0.9840909090909091|
| da| 1000| 980| 0.98|
| es| 1000| 976| 0.976|
| ro| 784| 765|0.9757653061224489|
| et| 928| 905|0.9752155172413793|
| lt| 1000| 975| 0.975|
| cs| 1000| 973| 0.973|
| pl| 914| 889|0.9726477024070022|
| sk| 1000| 941| 0.941|
| bg| 1000| 939| 0.939|
| lv| 916| 857|0.9355895196506551|
| sl| 914| 789|0.8632385120350109|
+--------+-----+-------+------------------+
+-------+-------------------+
|summary| precision|
+-------+-------------------+
| count| 21|
| mean| 0.9734546412641623|
| stddev|0.03176749551086062|
| min| 0.8632385120350109|
| max| 0.999|
+-------+-------------------+
```
---
layout: model
title: Hausa Named Entity Recognition (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa
date: 2022-05-17
tags: [xlm_roberta, ner, token_classification, ha, open_source]
task: Named Entity Recognition
language: ha
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-hausa` is a Hausa model orginally trained by `mbeukman`.
## Predicted Entities
`PER`, `ORG`, `LOC`, `DATE`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa_ha_3.4.2_3.0_1652808366927.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa_ha_3.4.2_3.0_1652808366927.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa","ha") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Ina son Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa","ha")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Ina son Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ha|
|Size:|775.3 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-hausa
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://www.apache.org/licenses/LICENSE-2.0
- https://github.com/Michael-Beukman/NERTransfer
- htt
---
layout: model
title: English BertForQuestionAnswering Small Cased model (from motiondew)
author: John Snow Labs
name: bert_qa_sd3_small
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-sd3-small` is a English model originally trained by `motiondew`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sd3_small_en_4.0.0_3.0_1657188265535.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sd3_small_en_4.0.0_3.0_1657188265535.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sd3_small","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sd3_small","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_sd3_small|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/motiondew/bert-sd3-small
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from drewski)
author: John Snow Labs
name: distilbert_qa_drewski_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `drewski`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_drewski_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770549448.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_drewski_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770549448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_drewski_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_drewski_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_drewski_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/drewski/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English Bert Embeddings (from kornosk)
author: John Snow Labs
name: bert_embeddings_bert_political_election2020_twitter_mlm
date: 2022-04-11
tags: [bert, embeddings, en, open_source]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-political-election2020-twitter-mlm` is a English model orginally trained by `kornosk`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_political_election2020_twitter_mlm_en_3.4.2_3.0_1649672268288.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_political_election2020_twitter_mlm_en_3.4.2_3.0_1649672268288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_political_election2020_twitter_mlm","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_political_election2020_twitter_mlm","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.bert_political_election2020_twitter_mlm").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_political_election2020_twitter_mlm|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|410.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/kornosk/bert-political-election2020-twitter-mlm
- https://www.aclweb.org/anthology/2021.naacl-main.376
- https://github.com/GU-DataLab/stance-detection-KE-MLM
- https://www.aclweb.org/anthology/2021.naacl-main.376
---
layout: model
title: Typed Dependency Parsing pipeline for English
author: John Snow Labs
name: dependency_parse
date: 2021-03-27
tags: [pipeline, dependency_parsing, untyped_dependency_parsing, typed_dependency_parsing, laballed_depdency_parsing, unlaballed_depdency_parsing, en, open_source]
supported: true
task: [Dependency Parser, Pipeline Public]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: Pipeline
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Typed Dependency parser, trained on the on the CONLL dataset.
Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical structure and defines the relationships between “head” words and words, which modify those heads.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dependency_parse_en_3.0.0_3.0_1616864258046.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dependency_parse_en_3.0.0_3.0_1616864258046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('dependency_parse', lang = 'en')
annotations = pipeline.fullAnnotate("Dependencies represents relationships betweens words in a Sentence "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("dependency_parse", lang = "en")
val result = pipeline.fullAnnotate("Dependencies represents relationships betweens words in a Sentence")(0)
```
{:.nlu-block}
```python
nlu.load("dep.typed").predict("Dependencies represents relationships betweens words in a Sentence")
```
## Results
```bash
+---------------------------------------------------------------------------------+--------------------------------------------------------+
|result |result |
+---------------------------------------------------------------------------------+--------------------------------------------------------+
|[ROOT, Dependencies, represents, words, relationships, Sentence, Sentence, words]|[root, parataxis, nsubj, amod, nsubj, case, nsubj, flat]|
+---------------------------------------------------------------------------------+--------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|dependency_parse|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
## Included Models
- DocumentAssembler
- SentenceDetector
- Tokenizer
- PerceptronModel
- DependencyParserModel
- TypedDependencyParserModel
---
layout: model
title: Word2Vec Embeddings in Quechua (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-16
tags: [cc, embeddings, fastText, word2vec, qu, open_source]
task: Embeddings
language: qu
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_qu_3.4.1_3.0_1647453594212.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_qu_3.4.1_3.0_1647453594212.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","qu") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","qu")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("qu.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|qu|
|Size:|110.0 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Lemmatizer (Czech, SpacyLookup)
author: John Snow Labs
name: lemma_spacylookup
date: 2022-03-03
tags: [open_source, lemmatizer, cs]
task: Lemmatization
language: cs
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Czech Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_cs_3.4.1_3.0_1646316557035.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_cs_3.4.1_3.0_1646316557035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","cs") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer])
example = spark.createDataFrame([["Nejste lepší než já"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","cs")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer))
val data = Seq("Nejste lepší než já").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("cs.lemma.spacylookup").predict("""Nejste lepší než já""")
```
## Results
```bash
+-------------------------+
|result |
+-------------------------+
|[Nejste, lepšit, než, já]|
+-------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma_spacylookup|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[lemma]|
|Language:|cs|
|Size:|379.0 KB|
---
layout: model
title: Clinical Deidentification (glove)
author: John Snow Labs
name: clinical_deidentification_glove
date: 2023-06-13
tags: [deidentification, en, licensed, pipeline]
task: Pipeline Healthcare
language: en
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline is trained with lightweight `glove_100d` embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_en_4.4.4_3.2_1686663547185.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_en_4.4.4_3.2_1686663547185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification_glove", "en", "clinical/models")
sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
result = deid_pipeline.annotate(sample)
print("\n".join(result['masked']))
print("\n".join(result['masked_with_chars']))
print("\n".join(result['masked_fixed_length_chars']))
print("\n".join(result['obfuscated']))
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification_glove","en","clinical/models")
val result = deid_pipeline.annotate("Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.glove_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification_glove", "en", "clinical/models")
sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
result = deid_pipeline.annotate(sample)
print("\n".join(result['masked']))
print("\n".join(result['masked_with_chars']))
print("\n".join(result['masked_fixed_length_chars']))
print("\n".join(result['obfuscated']))
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification_glove","en","clinical/models")
val result = deid_pipeline.annotate("Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.deid.glove_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""")
```
## Results
```bash
Results
Masked with entity labels
------------------------------
Name : , Record date: , # .
Dr. , ID: , IP .
He is a male was admitted to the for cystectomy on .
Patient's VIN : , SSN , Driver's license .
Phone , , , E-MAIL: .
Masked with chars
------------------------------
Name : [**************], Record date: [********], # [****].
Dr. [********], ID: [********], IP [************].
He is a [*********] male was admitted to the [**********] for cystectomy on [******].
Patient's VIN : [***************], SSN [**********], Driver's license [*********].
Phone [************], [***************], [***********], E-MAIL: [*************].
Masked with fixed length chars
------------------------------
Name : ****, Record date: ****, # ****.
Dr. ****, ID: ****, IP ****.
He is a **** male was admitted to the **** for cystectomy on ****.
Patient's VIN : ****, SSN ****, Driver's license ****.
Phone ****, ****, ****, E-MAIL: ****.
Obfuscated
------------------------------
Name : Berneta Anis, Record date: 2093-02-19, # U4660137.
Dr. Dr Worley Colonel, ID: ZJ:9570208, IP 005.005.005.005.
He is a 67 male was admitted to the ST. LUKE'S HOSPITAL AT THE VINTAGE for cystectomy on 06-02-1981.
Patient's VIN : 3CCCC22DDDD333888, SSN SSN-618-77-1042, Driver's license W693817528998.
Phone 0496 46 46 70, 3100 weston rd, Shattuck, E-MAIL: Freddi@hotmail.com.
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification_glove|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|181.4 MB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- MedicalNerModel
- NerConverter
- ChunkMergeModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- ChunkMergeModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- Finisher
---
layout: model
title: Financial English BERT Embeddings (Number shape masking)
author: John Snow Labs
name: bert_embeddings_sec_bert_sh
date: 2022-04-12
tags: [bert, embeddings, en, open_source, financial]
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Financial BERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `sec-bert-shape` is a English model orginally trained by `nlpaueb`.This model is the same as Bert Base but we replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'.
If you are interested in Financial Embeddings, take a look also at these two models:
- [sec-base](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_base_en_3_0.html): Same as BERT Base but trained with financial documents.
- [sec-num](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_num_en_3_0.html): Same as Bert sec-base but we replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_sh_en_3.4.2_3.0_1649758845734.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_sh_en_3.4.2_3.0_1649758845734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_sh","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_sh","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed.sec_bert_sh").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_sec_bert_sh|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|en|
|Size:|409.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/nlpaueb/sec-bert-shape
- https://arxiv.org/abs/2203.06482
- http://nlp.cs.aueb.gr/
---
layout: model
title: Stopwords Remover for Serbian language (389 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, sr, open_source]
task: Stop Words Removal
language: sr
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_sr_3.4.1_3.0_1646673003296.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_sr_3.4.1_3.0_1646673003296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","sr") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Ниси бољи од мене"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","sr")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Ниси бољи од мене").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("sr.stopwords").predict("""Ниси бољи од мене""")
```
## Results
```bash
+------+
|result|
+------+
|[бољи]|
+------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|sr|
|Size:|2.7 KB|
---
layout: model
title: Sentence Entity Resolver for LOINC (sbiobert_base_cased_mli embeddings)
author: John Snow Labs
name: sbiobertresolve_loinc_augmented
date: 2021-11-23
tags: [loinc, entity_resolution, clinical, en, licensed]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.3.2
spark_version: 2.4
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted clinical NER entities to LOINC codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It trained on the augmented version of the dataset which is used in previous LOINC resolver models.
## Predicted Entities
{:.btn-box}
[Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_augmented_en_3.3.2_2.4_1637664939262.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_augmented_en_3.3.2_2.4_1637664939262.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_loinc_augmented``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Test, BMI, HDL, LDL, Medical_Device, Temperature,
Total_Cholesterol, Triglycerides, Blood_Pressure``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols("document")\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical','en', 'clinical/models')\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(['Test'])
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc_augmented","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("loinc_code")\
.setDistanceFunction("EUCLIDEAN")
pipeline_loinc = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver])
data = spark.createDataFrame([["""The patient is a 22-year-old female with a history of obesity. She has a Body mass index (BMI) of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%."""]]).toDF("text")
results = pipeline_loinc.fit(data).transform(data)
```
```scala
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Test"))
val chunk2doc = Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli", "en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_loinc_augmented", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("loinc_code")
.setDistanceFunction("EUCLIDEAN")
val pipeline_loinc = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver))
val data = Seq("The patient is a 22-year-old female with a history of obesity. She has a Body mass index (BMI) of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.").toDF("text")
val result = pipeline_loinc.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.loinc.augmented").predict("""The patient is a 22-year-old female with a history of obesity. She has a Body mass index (BMI) of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.""")
```
## Results
```bash
+--------------------------+-----+---+------+----------+----------+--------------------------------------------------+--------------------------------------------------+
| chunk|begin|end|entity|confidence|Loinc_Code| all_codes| resolutions|
+--------------------------+-----+---+------+----------+----------+--------------------------------------------------+--------------------------------------------------+
| Body mass index| 74| 88| Test|0.39306664| LP35925-4|LP35925-4:::BDYCRC:::LP172732-2:::39156-5:::LP7...|body mass index:::body circumference:::body mus...|
|aspartate aminotransferase| 111|136| Test| 0.74925| LP15426-7|LP15426-7:::14409-7:::LP307348-5:::LP15333-5:::...|aspartate aminotransferase::: aspartate transam...|
| alanine aminotransferase| 146|169| Test| 0.9579| LP15333-5|LP15333-5:::LP307326-1:::16324-6:::LP307348-5::...|alanine aminotransferase:::alanine aminotransfe...|
| hgba1c| 180|185| Test| 0.1118| 17855-8|17855-8:::4547-6:::55139-0:::72518-4:::45190-6:...| hba1c::: hgb a1::: hb1::: hcds1::: hhc1::: htr...|
+--------------------------+-----+---+------+----------+----------+--------------------------------------------------+--------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_loinc_augmented|
|Compatibility:|Healthcare NLP 3.3.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[loinc_code]|
|Language:|en|
|Case sensitive:|false|
## Data Source
Trained on standard LOINC coding system.
---
layout: model
title: Generic Deidentification NER (SEC Bert Embeddings)
author: John Snow Labs
name: finner_deid_sec
date: 2023-02-24
tags: [deid, deidentification, anonymization, en, licensed]
task: Named Entity Recognition
language: en
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: FinanceNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a NER model which allows you to detect some generic entities that may require to be masked or obfuscated to be compliant with different regulations, as GDPR and CCPA. This is just an NER model, make sure you try the full De-identification pipelines available in Models Hub.
The only difference between this and `finner_deid` is the embeddings used.
## Predicted Entities
`AGE`, `CITY`, `COUNTRY`, `DATE`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `ORG`, `PERSON`, `PHONE`, `PROFESSION`, `STATE`, `STREET`, `URL`, `ZIP`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/DEID_FIN/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_deid_sec_en_1.0.0_3.0_1677282571388.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_deid_sec_en_1.0.0_3.0_1677282571388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained('finner_deid_sec', "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""
This LICENSE AND DEVELOPMENT AGREEMENT (this Agreement) is entered into effective as of Nov. 02, 2019 (the Effective Date) by and between Bioeq IP AG, having its principal place of business at 333 Twin Dolphin Drive, Suite 600, Redwood City, CA, 94065, USA (Licensee).
"""]
res = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+-----------+----------------+
| token| ner_label|
+-----------+----------------+
| This| O|
| LICENSE| O|
| AND| O|
|DEVELOPMENT| O|
| AGREEMENT| O|
| (| O|
| this| O|
| Agreement| O|
| )| O|
| is| O|
| entered| O|
| into| O|
| effective| O|
| as| O|
| of| O|
| Nov| B-DATE|
| .| I-DATE|
| 02| I-DATE|
| ,| I-DATE|
| 2019| I-DATE|
| (| O|
| the| O|
| Effective| O|
| Date| O|
| )| O|
| by| O|
| and| O|
| between| O|
| Bioeq| O|
| IP| O|
| AG| O|
| ,| O|
| having| O|
| its| O|
| principal| O|
| place| O|
| of| O|
| business| O|
| at| O|
| 333| B-STREET|
| Twin| I-STREET|
| Dolphin| I-STREET|
| Drive| I-STREET|
| ,| O|
| Suite|B-LOCATION-OTHER|
| 600|I-LOCATION-OTHER|
| ,| O|
| Redwood| B-CITY|
| City| I-CITY|
| ,| O|
| CA| B-STATE|
| ,| O|
| 94065| B-ZIP|
| ,| O|
| USA| B-STATE|
| (| O|
| Licensee| O|
| ).| O|
+-----------+----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finner_deid_sec|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|16.4 MB|
## References
In-house annotated documents with protected information
## Benchmarking
```bash
label precision recall f1-score support
B-AGE 0.96 0.89 0.92 245
B-CITY 0.85 0.86 0.86 123
B-COUNTRY 0.86 0.67 0.75 36
B-DATE 0.98 0.97 0.97 2352
B-ORG 0.75 0.71 0.73 38
B-PERSON 0.97 0.94 0.95 1348
B-PHONE 0.86 0.80 0.83 86
B-PROFESSION 0.93 0.75 0.83 84
B-STATE 0.92 0.89 0.91 102
B-STREET 0.99 0.91 0.95 89
I-CITY 0.82 0.77 0.79 35
I-COUNTRY 1.00 0.50 0.67 6
I-DATE 0.96 0.95 0.96 402
I-ORG 0.71 0.86 0.77 28
I-PERSON 0.98 0.96 0.97 1240
I-PHONE 0.91 0.92 0.92 77
I-PROFESSION 0.96 0.79 0.87 70
I-STATE 1.00 0.62 0.77 8
I-STREET 0.98 0.94 0.96 188
I-ZIP 0.84 0.97 0.90 60
O 1.00 1.00 1.00 194103
accuracy - - 1.00 200762
macro-avg 0.72 0.62 0.65 200762
weighted-avg 1.00 1.00 1.00 200762
```
---
layout: model
title: Fast Neural Machine Translation Model from Italic Languages to English
author: John Snow Labs
name: opus_mt_itc_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, itc, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `itc`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_itc_en_xx_2.7.0_2.4_1609170571711.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_itc_en_xx_2.7.0_2.4_1609170571711.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_itc_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_itc_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.itc.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_itc_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Navteca's Tapas Table Understanding (Large, WTQ)
author: John Snow Labs
name: table_qa_tapas_large_finetuned_wtq
date: 2022-09-30
tags: [en, table, qa, question, answering, open_source]
task: Table Question Answering
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: TapasForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark.
Size of this model: Large
Has aggregation operations?: True
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_large_finetuned_wtq_en_4.2.0_3.0_1664530763103.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_large_finetuned_wtq_en_4.2.0_3.0_1664530763103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
json_data = """
{
"header": ["name", "money", "age"],
"rows": [
["Donald Trump", "$100,000,000", "75"],
["Elon Musk", "$20,000,000,000,000", "55"]
]
}
"""
queries = [
"Who earns less than 200,000,000?",
"Who earns 100,000,000?",
"How much money has Donald Trump?",
"How old are they?",
]
data = spark.createDataFrame([
[json_data, " ".join(queries)]
]).toDF("table_json", "questions")
document_assembler = MultiDocumentAssembler() \
.setInputCols("table_json", "questions") \
.setOutputCols("document_table", "document_questions")
sentence_detector = SentenceDetector() \
.setInputCols(["document_questions"]) \
.setOutputCol("questions")
table_assembler = TableAssembler()\
.setInputCols(["document_table"])\
.setOutputCol("table")
tapas = TapasForQuestionAnswering\
.pretrained("table_qa_tapas_large_finetuned_wtq","en")\
.setInputCols(["questions", "table"])\
.setOutputCol("answers")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
table_assembler,
tapas
])
model = pipeline.fit(data)
model\
.transform(data)\
.selectExpr("explode(answers) AS answer")\
.select("answer")\
.show(truncate=False)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|answer |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} |
|{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} |
|{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|table_qa_tapas_large_finetuned_wtq|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|false|
## References
https://www.microsoft.com/en-us/download/details.aspx?id=54253
https://github.com/ppasupat/WikiTableQuestions
---
layout: model
title: Arabic BertForMaskedLM Mini Cased model (from asafaya)
author: John Snow Labs
name: bert_embeddings_mini_arabic
date: 2022-12-02
tags: [ar, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ar
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-mini-arabic` is a Arabic model originally trained by `asafaya`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_mini_arabic_ar_4.2.4_3.0_1670020664579.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_mini_arabic_ar_4.2.4_3.0_1670020664579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_mini_arabic","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_mini_arabic","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_mini_arabic|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|43.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/asafaya/bert-mini-arabic
- https://traces1.inria.fr/oscar/
- http://commoncrawl.org/
- https://dumps.wikimedia.org/backup-index.html
- https://github.com/google-research/bert
- https://www.tensorflow.org/tfrc
- https://github.com/alisafaya/Arabic-BERT
---
layout: model
title: Legal United Nations Document Classifier (EURLEX)
author: John Snow Labs
name: legclf_united_nations_bert
date: 2023-03-06
tags: [en, legal, classification, clauses, united_nations, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.
Given a document, the legclf_united_nations_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class United_Nations or not (Binary Classification) according to EuroVoc labels.
## Predicted Entities
`United_Nations`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_united_nations_bert_en_1.0.0_3.0_1678111655117.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_united_nations_bert_en_1.0.0_3.0_1678111655117.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[United_Nations]|
|[Other]|
|[Other]|
|[United_Nations]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_united_nations_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|21.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.85 0.85 0.85 20
United_Nations 0.87 0.87 0.87 23
accuracy - - 0.86 43
macro-avg 0.86 0.86 0.86 43
weighted-avg 0.86 0.86 0.86 43
```
---
layout: model
title: English BertForQuestionAnswering model (from datauma)
author: John Snow Labs
name: bert_qa_datauma_bert_finetuned_squad
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `datauma`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_datauma_bert_finetuned_squad_en_4.0.0_3.0_1654535608413.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_datauma_bert_finetuned_squad_en_4.0.0_3.0_1654535608413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_datauma_bert_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_datauma_bert_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.by_datauma").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_datauma_bert_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/datauma/bert-finetuned-squad
---
layout: model
title: English BertForQuestionAnswering model (from peggyhuang)
author: John Snow Labs
name: bert_qa_bert_base_uncased_coqa
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-coqa` is a English model orginally trained by `peggyhuang`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_coqa_en_4.0.0_3.0_1654180732707.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_coqa_en_4.0.0_3.0_1654180732707.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_coqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_uncased_coqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_uncased_coqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/peggyhuang/bert-base-uncased-coqa
---
layout: model
title: Universal Sentence Encoder Multilingual Large (tfhub_use_multi_lg)
author: John Snow Labs
name: tfhub_use_multi_lg
date: 2021-05-06
tags: [xx, open_source, embeddings]
task: Embeddings
language: xx
edition: Spark NLP 3.0.0
spark_version: 3.0
deprecated: true
annotator: UniversalSentenceEncoder
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.
The model is trained and optimized for greater-than-word length text, such as sentences, phrases, or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is the variable-length text and the output is a 512-dimensional vector. The universal-sentence-encoder model has trained with a deep averaging network (DAN) encoder.
This model supports 16 languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian) text encoder.
The details are described in the paper "[Multilingual Universal Sentence Encoder for Semantic Retrieval](https://arxiv.org/abs/1907.04307)".
Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_multi_lg_xx_3.0.0_3.0_1620294638956.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_multi_lg_xx_3.0.0_3.0_1620294638956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
It gives a 512-dimensional vector of the sentences.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|tfhub_use_multi_lg|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|xx|
## Data Source
This embeddings model is imported from [https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3](https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3)
## Benchmarking
```bash
- We apply this model to the STS benchmark for semantic similarity. Results are shown below:
STSBenchmark | dev | test |
-----------------------------------|--------|-------|
Correlation coefficient of Pearson | 0.837 | 0.825 |
- For semantic similarity retrieval, we evaluate the model on [Quora and AskUbuntu retrieval task.](https://arxiv.org/abs/1811.08008). Results are shown below:
Dataset | Quora | AskUbuntu | Average |
-----------------------|-------|-----------|---------|
Mean Average Precision | 89.1 | 42.3 | 65.7 |
- For the translation pair retrieval, we evaluate the model on the United Nation Parallel Corpus. Results are shown below:
Language Pair | en-es | en-fr | en-ru | en-zh |
---------------|--------|-------|-------|-------|
Precision@1 | 86.1 | 83.3 | 88.9 | 78.8 |
```
---
layout: model
title: Portuguese asr_bp_sid10_xlsr TFWav2Vec2ForCTC from lgris
author: John Snow Labs
name: asr_bp_sid10_xlsr
date: 2022-09-26
tags: [wav2vec2, pt, audio, open_source, asr]
task: Automatic Speech Recognition
language: pt
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp_sid10_xlsr` is a Portuguese model originally trained by lgris.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp_sid10_xlsr_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp_sid10_xlsr_pt_4.2.0_3.0_1664191635179.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp_sid10_xlsr_pt_4.2.0_3.0_1664191635179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_bp_sid10_xlsr", "pt")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_bp_sid10_xlsr", "pt")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_bp_sid10_xlsr|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|pt|
|Size:|756.4 MB|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from janeel)
author: John Snow Labs
name: roberta_qa_janeel_base_finetuned_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `janeel`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_janeel_base_finetuned_squad_en_4.3.0_3.0_1674217296605.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_janeel_base_finetuned_squad_en_4.3.0_3.0_1674217296605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_janeel_base_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_janeel_base_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_janeel_base_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.3 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/janeel/roberta-base-finetuned-squad
---
layout: model
title: Pipeline to Detect Clinical Entities
author: John Snow Labs
name: ner_jsl_greedy_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_jsl_greedy](https://nlp.johnsnowlabs.com/2021/06/24/ner_jsl_greedy_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_pipeline_en_3.4.1_3.0_1647869775586.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_pipeline_en_3.4.1_3.0_1647869775586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_jsl_greedy_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
```scala
val pipeline = new PretrainedPipeline("ner_jsl_greedy_pipeline", "en", "clinical/models")
pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.jsl_greedy.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
## Results
```bash
+----------------------------------------------+----------------------------+
|chunk |ner_label |
+----------------------------------------------+----------------------------+
|21-day-old |Age |
|Caucasian |Race_Ethnicity |
|male |Gender |
|for 2 days |Duration |
|congestion |Symptom |
|mom |Gender |
|suctioning yellow discharge |Symptom |
|nares |External_body_part_or_region|
|she |Gender |
|mild problems with his breathing while feeding|Symptom |
|perioral cyanosis |Symptom |
|retractions |Symptom |
|One day ago |RelativeDate |
|mom |Gender |
|tactile temperature |Symptom |
|Tylenol |Drug |
|Baby |Age |
|decreased p.o. intake |Symptom |
|His |Gender |
|20 minutes |Duration |
+----------------------------------------------+----------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_jsl_greedy_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|1.7 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Detect Problems, Tests and Treatments (ner_crf)
author: John Snow Labs
name: ner_crf
class: NerCrfModel
language: en
nav_key: models
repository: clinical/models
date: 2020-01-28
task: Named Entity Recognition
edition: Healthcare NLP 2.4.0
spark_version: 2.4
tags: [ner]
supported: true
annotator: NerCrfModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Named Entity recognition annotator allows for a generic model to be trained by CRF model
Clinical NER (Large) is a Named Entity Recognition model that annotates text to find references to clinical events. The entities it annotates are Problem, Treatment, and Test. Clinical NER is trained with the 'embeddings_clinical' word embeddings model, so be sure to use the same embeddings in the pipeline.
## Predicted Entities
`Problem`, `Test`, `Treatment`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_crf_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_crf_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPython.html %}
```python
model = NerCrfModel.pretrained("ner_crf","en","clinical/models")\
.setInputCols("sentence","token","pos","word_embeddings")\
.setOutputCol("ner")
```
```scala
val model = NerCrfModel.pretrained("ner_crf","en","clinical/models")
.setInputCols("sentence","token","pos","word_embeddings")
.setOutputCol("ner")
```
{:.model-param}
## Model Information
{:.table-model}
|---------------|---------------------------------------|
| Name: | ner_crf |
| Type: | NerCrfModel |
| Compatibility: | Spark NLP 2.4.0+ |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [sentence, token, pos, word_embeddings] |
|Output labels: | [ner] |
| Language: | en |
| Dependencies: | embeddings_clinical |
{:.h2_title}
## Data Source
Trained on i2b2 augmented data with `clinical_embeddings`
FILLUP
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223509314.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223509314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|460.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0
---
layout: model
title: Extract Clinical Department Entities from Voice of the Patient Documents (embeddings_clinical)
author: John Snow Labs
name: ner_vop_clinical_dept
date: 2023-06-06
tags: [licensed, clinical, en, ner, vop]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts medical devices and clinical department mentions terms from the documents transferred from the patient’s own sentences.
## Predicted Entities
`AdmissionDischarge`, `ClinicalDept`, `MedicalDevice`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_en_4.4.3_3.0_1686074506621.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_en_4.4.3_3.0_1686074506621.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_clinical_dept", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_clinical_dept", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:----------------------|:--------------|
| orthopedic department | ClinicalDept |
| titanium plate | MedicalDevice |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_clinical_dept|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.8 MB|
|Dependencies:|embeddings_clinical|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
AdmissionDischarge 29 1 5 34 0.97 0.85 0.91
ClinicalDept 289 31 37 326 0.90 0.89 0.89
MedicalDevice 253 97 79 332 0.72 0.76 0.74
macro_avg 571 129 121 692 0.86 0.83 0.85
micro_avg 571 129 121 692 0.82 0.83 0.82
```
---
layout: model
title: Arabic Electra Embeddings (from aubmindlab)
author: John Snow Labs
name: electra_embeddings_araelectra_base_generator
date: 2022-05-17
tags: [ar, open_source, electra, embeddings]
task: Embeddings
language: ar
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `araelectra-base-generator` is a Arabic model orginally trained by `aubmindlab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_araelectra_base_generator_ar_3.4.4_3.0_1652786188141.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_araelectra_base_generator_ar_3.4.4_3.0_1652786188141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("electra_embeddings_araelectra_base_generator","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("electra_embeddings_araelectra_base_generator","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("أنا أحب الشرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_embeddings_araelectra_base_generator|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|ar|
|Size:|222.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/aubmindlab/araelectra-base-generator
- https://arxiv.org/pdf/1406.2661.pdf
- https://arxiv.org/abs/2012.15516
- https://archive.org/details/arwiki-20190201
- https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4
- https://www.aclweb.org/anthology/W19-4619
- https://sites.aub.edu.lb/mindlab/
- https://www.yakshof.com/#/
- https://www.behance.net/rahalhabib
- https://www.linkedin.com/in/wissam-antoun-622142b4/
- https://twitter.com/wissam_antoun
- https://github.com/WissamAntoun
- https://www.linkedin.com/in/fadybaly/
- https://twitter.com/fadybaly
- https://github.com/fadybaly
---
layout: model
title: English DistilBertForTokenClassification Cased model (from Neurona)
author: John Snow Labs
name: distilbert_token_classifier_cpener_test
date: 2023-03-14
tags: [en, open_source, distilbert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
edition: Spark NLP 4.3.1
spark_version: 3.0
supported: true
engine: tensorflow
annotator: DistilBertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cpener-test` is a English model originally trained by `Neurona`.
## Predicted Entities
`cpe_vendor`, `cpe_version`, `cpe_product`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_cpener_test_en_4.3.1_3.0_1678783066417.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_cpener_test_en_4.3.1_3.0_1678783066417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_cpener_test","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_cpener_test","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_token_classifier_cpener_test|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Neurona/cpener-test
---
layout: model
title: Legal Warranties Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_warranties_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, warranties, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Warranties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Warranties`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_warranties_bert_en_1.0.0_3.0_1678049972315.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_warranties_bert_en_1.0.0_3.0_1678049972315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Warranties]|
|[Other]|
|[Other]|
|[Warranties]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_warranties_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Other 0.86 0.87 0.86 83
Warranties 0.81 0.80 0.80 59
accuracy - - 0.84 142
macro-avg 0.83 0.83 0.83 142
weighted-avg 0.84 0.84 0.84 142
```
---
layout: model
title: Modern Greek (1453-) asr_xlsr_53_wav2vec_greek TFWav2Vec2ForCTC from harshit345
author: John Snow Labs
name: asr_xlsr_53_wav2vec_greek
date: 2022-09-25
tags: [wav2vec2, el, audio, open_source, asr]
task: Automatic Speech Recognition
language: el
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_53_wav2vec_greek` is a Modern Greek (1453-) model originally trained by harshit345.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_53_wav2vec_greek_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_53_wav2vec_greek_el_4.2.0_3.0_1664109101429.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_53_wav2vec_greek_el_4.2.0_3.0_1664109101429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_xlsr_53_wav2vec_greek", "el")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_xlsr_53_wav2vec_greek", "el")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_xlsr_53_wav2vec_greek|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|el|
|Size:|1.2 GB|
---
layout: model
title: Google T5 (Text-To-Text Transfer Transformer) Base
author: John Snow Labs
name: t5_base
date: 2021-01-08
task: [Question Answering, Summarization, Translation]
language: en
nav_key: models
edition: Spark NLP 2.7.1
spark_version: 2.4
tags: [open_source, t5, summarization, translation, en, seq2seq]
supported: true
recommended: true
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The T5 transformer model described in the seminal paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". This model can perform a variety of tasks, such as text summarization, question answering, and translation. More details about using the model can be found in the paper (https://arxiv.org/pdf/1910.10683.pdf).
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/T5TRANSFORMER/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5TRANSFORMER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_en_2.7.1_2.4_1610133506835.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_en_2.7.1_2.4_1610133506835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Either set the following tasks or have them inline with your input:
- summarize:
- translate English to German:
- translate English to French:
- stsb sentence1: Big news. sentence2: No idea.
The full list of tasks is in the Appendix of the paper: https://arxiv.org/pdf/1910.10683.pdf
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
t5 = T5Transformer() \
.pretrained("t5_base") \
.setTask("summarize:")\
.setMaxOutputLength(200)\
.setInputCols(["documents"]) \
.setOutputCol("summaries")
pipeline = Pipeline().setStages([document_assembler, t5])
results = pipeline.fit(data_df).transform(data_df)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("documents")
val t5 = T5Transformer
.pretrained("t5_base")
.setTask("summarize:")
.setInputCols(Array("documents"))
.setOutputCol("summaries")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val result = pipeline.fit(dataDf).transform(dataDf)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.t5.base").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_base|
|Compatibility:|Spark NLP 2.7.1+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[t5]|
|Language:|en|
## Data Source
https://huggingface.co/t5-base
---
layout: model
title: Voice of the Patients
author: John Snow Labs
name: ner_vop_wip
date: 2023-05-19
tags: [licensed, clinical, en, ner, vop, patient]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences.
Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases.
## Predicted Entities
`TestResult`, `SubstanceQuantity`, `InjuryOrPoisoning`, `Treatment`, `Modifier`, `HealthStatus`, `MedicalDevice`, `Procedure`, `Symptom`, `Frequency`, `RelationshipStatus`, `Duration`, `Allergen`, `VitalTest`, `Disease`, `Dosage`, `AdmissionDischarge`, `Test`, `Laterality`, `Route`, `DateTime`, `Drug`, `ClinicalDept`, `Vaccine`, `Form`, `Substance`, `PsychologicalCondition`, `Age`, `BodyPart`, `Employment`, `Gender`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_en_4.4.2_3.0_1684508941946.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_en_4.4.2_3.0_1684508941946.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:---------------------|:-----------------------|
| 20 year old | Age |
| girl | Gender |
| hyperthyroid | Disease |
| 1 month ago | DateTime |
| weak | Symptom |
| light | Symptom |
| panic attacks | PsychologicalCondition |
| depression | PsychologicalCondition |
| left | Laterality |
| chest | BodyPart |
| pain | Symptom |
| increased | TestResult |
| heart rate | VitalTest |
| rapidly | Modifier |
| weight loss | Symptom |
| 4 months | Duration |
| hospital | ClinicalDept |
| discharged | AdmissionDischarge |
| hospital | ClinicalDept |
| blood tests | Test |
| brain | BodyPart |
| mri | Test |
| ultrasound scan | Test |
| endoscopy | Procedure |
| doctors | Employment |
| homeopathy doctor | Employment |
| he | Gender |
| hyperthyroid | Disease |
| TSH | Test |
| 0.15 | TestResult |
| T3 | Test |
| T4 | Test |
| normal | TestResult |
| b12 deficiency | Disease |
| vitamin D deficiency | Disease |
| weekly | Frequency |
| supplement | Drug |
| vitamin D | Drug |
| 1000 mcg | Dosage |
| b12 | Drug |
| daily | Frequency |
| homeopathy medicine | Drug |
| 40 days | Duration |
| after 30 days | DateTime |
| TSH | Test |
| 0.5 | TestResult |
| now | DateTime |
| weakness | Symptom |
| depression | PsychologicalCondition |
| last week | DateTime |
| rapid heartrate | Symptom |
| allopathy medicine | Drug |
| homeopathy | Treatment |
| thyroid | BodyPart |
| allopathy | Treatment |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_wip|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.9 MB|
|Dependencies:|embeddings_clinical|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
"Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."
## Benchmarking
```bash
label tp fp fn total precision recall f1
TestResult 353 98 171 524 0.78 0.67 0.72
SubstanceQuantity 60 20 25 85 0.75 0.71 0.73
InjuryOrPoisoning 122 37 54 176 0.77 0.69 0.73
Treatment 150 30 78 228 0.83 0.66 0.74
Modifier 817 214 322 1139 0.79 0.72 0.75
HealthStatus 80 24 27 107 0.77 0.75 0.76
MedicalDevice 250 71 82 332 0.78 0.75 0.77
Procedure 576 156 129 705 0.79 0.82 0.80
Symptom 3831 858 744 4575 0.82 0.84 0.83
Frequency 865 147 214 1079 0.85 0.80 0.83
RelationshipStatus 19 2 5 24 0.90 0.79 0.84
Duration 1845 244 465 2310 0.88 0.80 0.84
Allergen 38 4 8 46 0.90 0.83 0.86
VitalTest 143 16 29 172 0.90 0.83 0.86
Disease 1745 296 270 2015 0.85 0.87 0.86
Dosage 348 48 64 412 0.88 0.84 0.86
AdmissionDischarge 29 4 5 34 0.88 0.85 0.87
Test 1064 136 144 1208 0.89 0.88 0.88
Laterality 542 68 86 628 0.89 0.86 0.88
Route 42 5 6 48 0.89 0.88 0.88
DateTime 4075 706 327 4402 0.85 0.93 0.89
Drug 1323 196 117 1440 0.87 0.92 0.89
ClinicalDept 280 25 46 326 0.92 0.86 0.89
Vaccine 37 4 5 42 0.90 0.88 0.89
Form 252 34 14 266 0.88 0.95 0.91
Substance 398 58 23 421 0.87 0.95 0.91
PsychologicalCondition 411 42 33 444 0.91 0.93 0.92
Age 529 44 53 582 0.92 0.91 0.92
BodyPart 2730 224 170 2900 0.92 0.94 0.93
Employment 1168 37 75 1243 0.97 0.94 0.95
Gender 1292 21 25 1317 0.98 0.98 0.98
macro_avg 25414 3869 3816 29230 0.86 0.84 0.85
micro_avg 25414 3869 3816 29230 0.87 0.87 0.87
```
---
layout: model
title: Medical Spell Checker
author: John Snow Labs
name: spellcheck_clinical
date: 2022-04-14
tags: [spellcheck, medical, medical_spell_checker, spell_checker, spelling_corrector, en, licensed, clinical]
task: Spell Check
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 2.4
supported: true
annotator: SpellCheckModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Contextual Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your medical input text. It’s based on Levenshtein Automation for generating candidate corrections and a Neural Language Model for ranking corrections. This model has been trained in a dataset containing data from different sources; MTSamples, i2b2 clinical notes, and several specific medical corpora. You can download the model that comes fully pretrained and ready to use. However, you can still customize it further without the need for re-training a new model from scratch. This can be accomplished by providing custom definitions for the word classes the model has been trained on, namely Dates, Numbers, Ages, Units, and Medications. This model is trained for PySpark 2.4.x users with SparkNLP 3.4.1.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainingstreamlit_notebooks/Hhealthcare/6.Clinical_Context_Spell_CheckerCONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_3.4.1_2.4_1649926082521.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_3.4.1_2.4_1649926082521.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")\
.setContextChars(["*", "-", "“", "(", "[", "\n", ".","\"", "”", ",", "?", ")", "]", "!", ";", ":", "'s", "’s"])
spellModel = ContextSpellCheckerModel\
.pretrained('spellcheck_clinical', 'en', 'clinical/models')\
.setInputCols("token")\
.setOutputCol("checked")
pipeline = Pipeline(stages = [
documentAssembler,
tokenizer,
spellModel])
light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
example = ["Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.",
"With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.",
"Abdomen is sort, nontender, and nonintended.",
"Patient not showing pain or any wealth problems.",
"No cute distress"]
result = light_pipeline.annotate(example)
```
```scala
val assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
.setContextChars(Array("*", "-", "“", "(", "[", "\n", ".","\"", "”", ",", "?", ")", "]", "!", ";", ":", "'s", "’s"))
val spellChecker = ContextSpellCheckerModel.
pretrained("spellcheck_clinical", "en", "clinical/models").
setInputCols("token").
setOutputCol("checked")
val pipeline = new Pipeline().setStages(Array(
assembler,
tokenizer,
spellChecker))
val light_pipeline = new LightPipeline(pipeline.fit(Seq("").toDF("text")))
val text = Array("Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.",
"With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.",
"Abdomen is sort, nontender, and nonintended.",
"Patient not showing pain or any wealth problems.",
"No cute distress")
val result = light_pipeline.annotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.spell.clinical").predict(""")
pipeline = Pipeline(stages = [
documentAssembler,
tokenizer,
spellModel])
light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""")
```
## Results
```bash
[{'checked': ['With','the','cell','of','physical','therapy','the','patient','was','ambulated','and','on','postoperative',',','the','patient','tolerating','a','post','surgical','soft','diet','.'],
'document': ['Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.'],
'token': ['Witth','the','hell','of','phisical','terapy','the','patient','was','imbulated','and','on','postoperative',',','the','impatient','tolerating','a','post','curgical','soft','diet','.']},
{'checked': ['With','pain','well','controlled','on','oral','pain','medications',',','she','was','discharged','to','rehabilitation','facility','.'],
'document': ['With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.'],
'token': ['With','paint','wel','controlled','on','orall','pain','medications',',','she','was','discharged','too','reihabilitation','facilitay','.']},
{'checked': ['Abdomen','is','soft',',','nontender',',','and','nondistended','.'],
'document': ['Abdomen is sort, nontender, and nonintended.'],
'token': ['Abdomen','is','sort',',','nontender',',','and','nonintended','.']},
{'checked': ['Patient','not','showing','pain','or','any','health','problems','.'],
'document': ['Patient not showing pain or any wealth problems.'],
'token': ['Patient','not','showing','pain','or','any','wealth','problems','.']},
{'checked': ['No', 'acute', 'distress'],
'document': ['No cute distress'],
'token': ['No', 'cute', 'distress']}]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|spellcheck_clinical|
|Compatibility:|Healthcare NLP 3.4.1|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[corrected]|
|Language:|en|
|Size:|141.2 MB|
## References
MTSamples, i2b2 clinical notes, and several specific medical corpora.
---
layout: model
title: English DistilBertForQuestionAnswering model (from kaggleodin)
author: John Snow Labs
name: distilbert_qa_kaggleodin_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kaggleodin`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaggleodin_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725640096.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaggleodin_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725640096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaggleodin_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaggleodin_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_kaggleodin").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_kaggleodin_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/kaggleodin/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77 TFWav2Vec2ForCTC from emeson77
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77` is a English model originally trained by emeson77.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_en_4.2.0_3.0_1664037259824.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_en_4.2.0_3.0_1664037259824.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from mrm8488)
author: John Snow Labs
name: t5_base_finetuned_break_data
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-break_data` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_break_data_en_4.3.0_3.0_1675108822999.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_break_data_en_4.3.0_3.0_1675108822999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_base_finetuned_break_data","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_base_finetuned_break_data","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_base_finetuned_break_data|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|878.4 MB|
## References
- https://huggingface.co/mrm8488/t5-base-finetuned-break_data
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/pdf/1910.10683.pdf
- https://i.imgur.com/jVFMMWR.png
- https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb
- https://twitter.com/psuraj28
- https://twitter.com/mrm8488
- https://www.linkedin.com/in/manuel-romero-cs/
---
layout: model
title: English asr_wav2vec2_base_checkpoint_6 TFWav2Vec2ForCTC from jiobiala24
author: John Snow Labs
name: asr_wav2vec2_base_checkpoint_6
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_checkpoint_6` is a English model originally trained by jiobiala24.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_checkpoint_6_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_checkpoint_6_en_4.2.0_3.0_1664020777200.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_checkpoint_6_en_4.2.0_3.0_1664020777200.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_base_checkpoint_6", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_base_checkpoint_6", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_base_checkpoint_6|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|349.2 MB|
---
layout: model
title: English asr_wav2vec2_murad_with_some_data TFWav2Vec2ForCTC from MBMMurad
author: John Snow Labs
name: pipeline_asr_wav2vec2_murad_with_some_data
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_murad_with_some_data` is a English model originally trained by MBMMurad.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_murad_with_some_data_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_murad_with_some_data_en_4.2.0_3.0_1664111468486.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_murad_with_some_data_en_4.2.0_3.0_1664111468486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_murad_with_some_data', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_murad_with_some_data", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_murad_with_some_data|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Fast Neural Machine Translation Model from English to Baltic Languages
author: John Snow Labs
name: opus_mt_en_bat
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, bat, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `bat`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_bat_xx_2.7.0_2.4_1609166860296.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_bat_xx_2.7.0_2.4_1609166860296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_bat", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_bat", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.bat').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_bat|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering model (from vuiseng9)
author: John Snow Labs
name: bert_qa_bert_base_squadv1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-squadv1` is a English model orginally trained by `vuiseng9`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_squadv1_en_4.0.0_3.0_1654180625513.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_squadv1_en_4.0.0_3.0_1654180625513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_squadv1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_squadv1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.base.by_vuiseng9").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_squadv1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/vuiseng9/bert-base-squadv1
---
layout: model
title: Stopwords Remover for Polish language (381 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, pl, open_source]
task: Stop Words Removal
language: pl
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_pl_3.4.1_3.0_1646673188092.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_pl_3.4.1_3.0_1646673188092.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","pl") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Nie jesteś lepszy ode mnie"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","pl")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Nie jesteś lepszy ode mnie").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("pl.stopwords").predict("""Nie jesteś lepszy ode mnie""")
```
## Results
```bash
+---------------------+
|result |
+---------------------+
|[jesteś, lepszy, ode]|
+---------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|pl|
|Size:|2.4 KB|
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from chiendvhust)
author: John Snow Labs
name: roberta_qa_chiendvhust_base_finetuned_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `chiendvhust`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_chiendvhust_base_finetuned_squad_en_4.3.0_3.0_1674217182008.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_chiendvhust_base_finetuned_squad_en_4.3.0_3.0_1674217182008.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_chiendvhust_base_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_chiendvhust_base_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_chiendvhust_base_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|457.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/chiendvhust/roberta-base-finetuned-squad
---
layout: model
title: Translate English to Sango Pipeline
author: John Snow Labs
name: translate_en_sg
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, sg, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `sg`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sg_xx_2.7.0_2.4_1609686246000.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sg_xx_2.7.0_2.4_1609686246000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_sg", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_sg", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.sg').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_sg|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Word2Vec Embeddings in Afrikaans (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-14
tags: [cc, embeddings, fastText, word2vec, af, open_source]
task: Embeddings
language: af
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_af_3.4.1_3.0_1647281785039.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_af_3.4.1_3.0_1647281785039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","af") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ek is lief vir Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","af")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ek is lief vir Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("af.embed.w2v_cc_300d").predict("""Ek is lief vir Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|af|
|Size:|515.9 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from hogger32)
author: John Snow Labs
name: distilbert_qa_hogger32_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hogger32`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hogger32_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771247105.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hogger32_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771247105.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hogger32_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hogger32_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_hogger32_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/hogger32/distilbert-base-uncased-finetuned-squad
---
layout: model
title: RCT Binary Classifier (USE) Pipeline
author: John Snow Labs
name: rct_binary_classifier_use_pipeline
date: 2022-06-06
tags: [licensed, rct, clinical, classifier, en]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 3.4.2
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [rct_binary_classifier_use](https://nlp.johnsnowlabs.com/2022/05/27/rct_binary_classifier_use_en_3_0.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_RCT/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLASSIFICATION_RCT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_use_pipeline_en_3.4.2_3.0_1654517524887.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_use_pipeline_en_3.4.2_3.0_1654517524887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("rct_binary_classifier_use_pipeline", "en", "clinical/models")
result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("rct_binary_classifier_use_pipeline", "en", "clinical/models")
val result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.rct_binary_use.pipeline").predict("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """)
```
## Results
```bash
+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|rct |text |
+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|true|Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. |
+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|rct_binary_classifier_use_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|991.2 MB|
## Included Models
- DocumentAssembler
- UniversalSentenceEncoder
- ClassifierDLModel
---
layout: model
title: Spanish BertForQuestionAnswering model (from MMG)
author: John Snow Labs
name: bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad
date: 2022-06-02
tags: [es, open_source, question_answering, bert]
task: Question Answering
language: es
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-sqac-finetuned-squad` is a Spanish model orginally trained by `MMG`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad_es_4.0.0_3.0_1654180513805.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad_es_4.0.0_3.0_1654180513805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad","es") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.answer_question.squad_sqac.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|es|
|Size:|410.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/MMG/bert-base-spanish-wwm-cased-finetuned-sqac-finetuned-squad
---
layout: model
title: Bulgarian Lemmatizer
author: John Snow Labs
name: lemma
date: 2020-05-05 11:14:00 +0800
task: Lemmatization
language: bg
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [lemmatizer, bg]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_bg_2.5.0_2.4_1588666297763.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_bg_2.5.0_2.4_1588666297763.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
lemmatizer = LemmatizerModel.pretrained("lemma", "bg") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена.")
```
```scala
...
val lemmatizer = LemmatizerModel.pretrained("lemma", "bg")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена."""]
lemma_df = nlu.load('bg.lemma').predict(text, output_level='document')
lemma_df.lemma.values[0]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=4, result='Освен', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=6, end=7, result='че', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=9, end=9, result='съм', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=11, end=14, result='крада', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=16, end=17, result='на', metadata={'sentence': '0'}, embeddings=[]),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Type:|lemmatizer|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[lemma]|
|Language:|bg|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from carolgao66)
author: John Snow Labs
name: distilbert_qa_carolgao66_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `carolgao66`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_carolgao66_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770348185.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_carolgao66_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770348185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_carolgao66_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_carolgao66_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_carolgao66_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/carolgao66/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Fast Neural Machine Translation Model from Indo-Iranian Languages to English
author: John Snow Labs
name: opus_mt_iir_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, iir, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `iir`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_iir_en_xx_2.7.0_2.4_1609163487551.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_iir_en_xx_2.7.0_2.4_1609163487551.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_iir_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_iir_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.iir.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_iir_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Dutch RoBERTa Embeddings
author: John Snow Labs
name: roberta_embeddings_robbert_v2_dutch_base
date: 2022-04-14
tags: [roberta, embeddings, nl, open_source]
task: Embeddings
language: nl
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `robbert-v2-dutch-base` is a Dutch model orginally trained by `pdelobelle`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbert_v2_dutch_base_nl_3.4.2_3.0_1649949003731.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbert_v2_dutch_base_nl_3.4.2_3.0_1649949003731.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbert_v2_dutch_base","nl") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbert_v2_dutch_base","nl")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Ik hou van vonk nlp").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("nl.embed.robbert_v2_dutch_base").predict("""Ik hou van vonk nlp""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eu") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Txinparta nlp maite dut"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eu")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Txinparta nlp maite dut").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("eu.embed.w2v_cc_300d").predict("""Txinparta nlp maite dut""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|eu|
|Size:|1.1 GB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Clean patterns pipeline for English
author: John Snow Labs
name: clean_pattern
date: 2021-03-24
tags: [open_source, english, clean_pattern, pipeline, en]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The clean_pattern is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clean_pattern_en_3.0.0_3.0_1616544446008.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clean_pattern_en_3.0.0_3.0_1616544446008.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('clean_pattern', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("clean_pattern", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.clean.pattern').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | normal |
|---:|:-----------|:-----------|:----------|:----------|
| 0 | ['Hello'] | ['Hello'] | ['Hello'] | ['Hello'] || | document | sentence | token | normal |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clean_pattern|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
---
layout: model
title: Detect Assertion Status (assertion_wip)
author: John Snow Labs
name: jsl_assertion_wip
date: 2021-01-18
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 2.7.0
spark_version: 2.4
tags: [clinical, licensed, assertion, en, ner]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The deep neural network architecture for assertion status detection in Spark NLP is based on a BiLSTM framework, and is a modified version of the architecture proposed by Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances, hypothetically present in the patient at some future point, and mentioned in the patient report but associated with someoneelse (Uzuner et al. 2011).
{:.h2_title}
## Predicted Entities
`Present`, `Absent`, `Possible`, `Planned`, `Someoneelse`, `Past`, `Family`, `None`, `Hypotetical`.
{:.btn-box}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_assertion_wip_en_2.6.1_2.4_1606860510166.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_assertion_wip_en_2.6.1_2.4_1606860510166.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, NerConverter, AssertionDLModel.
{% include programmingLanguageSelectScalaPython.html %}
```python
...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
clinical_assertion = AssertionDLModel.pretrained("jsl_assertion_wip", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"])
```
```scala
...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val clinical_assertion = AssertionDLModel.pretrained("jsl_assertion_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion))
val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and an ``"assertion"`` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select ``"ner_chunk.result"`` and ``"assertion.result"`` from your output dataframe.
```bash
+-----------------------------------------+-----+---+----------------------------+-------+---------+
|chunk |begin|end|ner_label |sent_id|assertion|
+-----------------------------------------+-----+---+----------------------------+-------+---------+
|21-day-old |17 |26 |Age |0 |Family |
|Caucasian |28 |36 |Race_Ethnicity |0 |Family |
|male |38 |41 |Gender |0 |Family |
|for 2 days |48 |57 |Duration |0 |Family |
|congestion |62 |71 |Symptom |0 |Present |
|mom |75 |77 |Gender |0 |Family |
|yellow |99 |104|Modifier |0 |Family |
|discharge |106 |114|Symptom |0 |Family |
|nares |135 |139|External_body_part_or_region|0 |Family |
|she |147 |149|Gender |0 |Family |
|mild |168 |171|Modifier |0 |Family |
|problems with his breathing while feeding|173 |213|Symptom |0 |Present |
|perioral cyanosis |237 |253|Symptom |0 |Absent |
|retractions |258 |268|Symptom |0 |Absent |
|One day ago |272 |282|RelativeDate |1 |Family |
|mom |285 |287|Gender |1 |Family |
|Tylenol |345 |351|Drug_BrandName |1 |Family |
|Baby |354 |357|Age |2 |Family |
|decreased p.o. intake |377 |397|Symptom |2 |Family |
|His |400 |402|Gender |3 |Family |
+-----------------------------------------+-----+---+----------------------------+-------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|jsl_assertion_wip|
|Type:|ner|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence, ner_chunk, embeddings]|
|Output Labels:|[assertion]|
|Language:|[en]|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with 'embeddings_clinical'.
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
{:.h2_title}
## Benchmarking
```bash
label prec rec f1
Absent 0.970 0.943 0.956
Someoneelse 0.868 0.775 0.819
Planned 0.721 0.754 0.737
Possible 0.852 0.884 0.868
Past 0.811 0.823 0.817
Present 0.833 0.866 0.849
Family 0.872 0.921 0.896
None 0.609 0.359 0.452
Hypothetical 0.722 0.810 0.763
Macro-average 0.888 0.872 0.880
Micro-average 0.908 0.908 0.908
```
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from saraks)
author: John Snow Labs
name: distilbert_qa_cuad_parties_cased_08_31_v1
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-parties-cased-08-31-v1` is a English model originally trained by `saraks`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_parties_cased_08_31_v1_en_4.3.0_3.0_1672766262518.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_parties_cased_08_31_v1_en_4.3.0_3.0_1672766262518.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_parties_cased_08_31_v1","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_parties_cased_08_31_v1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_cuad_parties_cased_08_31_v1|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/saraks/cuad-distil-parties-cased-08-31-v1
---
layout: model
title: Ganda asr_wav2vec2_xlsr_multilingual_56 TFWav2Vec2ForCTC from voidful
author: John Snow Labs
name: asr_wav2vec2_xlsr_multilingual_56
date: 2022-09-24
tags: [wav2vec2, lg, audio, open_source, asr]
task: Automatic Speech Recognition
language: lg
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_multilingual_56` is a Ganda model originally trained by voidful.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_multilingual_56_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_multilingual_56_lg_4.2.0_3.0_1664035818043.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_multilingual_56_lg_4.2.0_3.0_1664035818043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xlsr_multilingual_56", "lg")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xlsr_multilingual_56", "lg")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xlsr_multilingual_56|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|lg|
|Size:|1.2 GB|
---
layout: model
title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18)
author: John Snow Labs
name: roberta_qa_base_spanish_squades_becasincentivos2
date: 2023-01-20
tags: [es, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: es
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becasIncentivos2` is a Spanish model originally trained by `Evelyn18`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos2_es_4.3.0_3.0_1674218030841.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos2_es_4.3.0_3.0_1674218030841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos2","es")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos2","es")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_spanish_squades_becasincentivos2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|es|
|Size:|459.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becasIncentivos2
---
layout: model
title: English RobertaForQuestionAnswering (from sunitha)
author: John Snow Labs
name: roberta_qa_roberta_customds_finetune
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-customds-finetune` is a English model originally trained by `sunitha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_customds_finetune_en_4.0.0_3.0_1655735713085.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_customds_finetune_en_4.0.0_3.0_1655735713085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_customds_finetune","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_customds_finetune","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.roberta.by_sunitha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_customds_finetune|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|464.1 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sunitha/roberta-customds-finetune
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from abhinavkulkarni)
author: John Snow Labs
name: distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `abhinavkulkarni`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769656788.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769656788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/abhinavkulkarni/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Translate Waray to English Pipeline
author: John Snow Labs
name: translate_war_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, war, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `war`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_war_en_xx_2.7.0_2.4_1609700727408.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_war_en_xx_2.7.0_2.4_1609700727408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_war_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_war_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.war.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_war_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Legal The merger Clause Binary Classifier
author: John Snow Labs
name: legclf_the_merger_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `the-merger` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `the-merger`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_the_merger_clause_en_1.0.0_3.2_1660123103140.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_the_merger_clause_en_1.0.0_3.2_1660123103140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[the-merger]|
|[other]|
|[other]|
|[the-merger]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_the_merger_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.98 1.00 0.99 98
the-merger 1.00 0.95 0.97 38
accuracy - - 0.99 136
macro-avg 0.99 0.97 0.98 136
weighted-avg 0.99 0.99 0.99 136
```
---
layout: model
title: Pipeline to Detect Clinical Events
author: John Snow Labs
name: ner_events_healthcare_pipeline
date: 2022-03-22
tags: [licensed, ner, clinical, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_events_healthcare](https://nlp.johnsnowlabs.com/2021/04/01/ner_events_healthcare_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange.button-orange-trans.arr.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_EVENTS_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_healthcare_pipeline_en_3.4.1_3.0_1647943997404.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_healthcare_pipeline_en_3.4.1_3.0_1647943997404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_events_healthcare_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("The patient presented to the emergency room last evening")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_events_healthcare_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("The patient presented to the emergency room last evening")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.healthcare_events.pipeline").predict("""The patient presented to the emergency room last evening""")
```
## Results
```bash
+------------------+-------------+
|chunks |entities |
+------------------+-------------+
|presented |EVIDENTIAL |
|the emergency room|CLINICAL_DEPT|
|last evening |DATE |
+------------------+-------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_events_healthcare_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|513.6 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Translate English to Chuukese Pipeline
author: John Snow Labs
name: translate_en_chk
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, chk, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `chk`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_chk_xx_2.7.0_2.4_1609689317048.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_chk_xx_2.7.0_2.4_1609689317048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_chk", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_chk", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.chk').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_chk|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering model (from dmis-lab)
author: John Snow Labs
name: bert_qa_biobert_base_cased_v1.1_squad
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert-base-cased-v1.1-squad` is a English model orginally trained by `dmis-lab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_en_4.0.0_3.0_1654185575469.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_en_4.0.0_3.0_1654185575469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_base_cased_v1.1_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_biobert_base_cased_v1.1_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.biobert.base_cased.by_dmis-lab").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_biobert_base_cased_v1.1_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|403.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/dmis-lab/biobert-base-cased-v1.1-squad
---
layout: model
title: Korean Lemmatizer
author: John Snow Labs
name: lemma
date: 2021-01-15
task: Lemmatization
language: ko
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [ko, lemmatizer, open_source]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ko_2.7.0_2.4_1610747055280.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ko_2.7.0_2.4_1610747055280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
word_segmenter = WordSegmenterModel.pretrained('wordseg_kaist_ud', 'ko')\
.setInputCols("document")\
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma", "ko") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, word_segmenter , lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
results = light_pipeline.fullAnnotate(["이렇게되면이러한인간형을다투어본받으려할것이틀림없다."])
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko")
.setInputCols("document")
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma", "ko")
.setInputCols("token")
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter , lemmatizer))
val data = Seq("이렇게되면이러한인간형을다투어본받으려할것이틀림없다.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["이렇게되면이러한인간형을다투어본받으려할것이틀림없다."]
lemma_df = nlu.load('ko.lemma').predict(text, output_level = "document")
lemma_df.lemma.values[0]
```
## Results
```bash
{'lemma': [Annotation(token, 0, 2, 이렇게, {'sentence': '0'}),
Annotation(token, 3, 4, 되+면, {'sentence': '0'}),
Annotation(token, 5, 7, 이러한+ㄴ, {'sentence': '0'}),
Annotation(token, 8, 11, 인간형+을, {'sentence': '0'}),
Annotation(token, 12, 15, 다투어본, {'sentence': '0'}),
Annotation(token, 16, 18, 받으할, {'sentence': '0'}),
Annotation(token, 18, 18, 려, {'sentence': '0'}),
Annotation(token, 20, 21, 것+이, {'sentence': '0'}),
Annotation(token, 22, 25, 틀림없+다, {'sentence': '0'}),
Annotation(token, 26, 26, ., {'sentence': '0'})]}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[token]|
|Language:|ko|
## Data Source
The model was trained on the universal dependencies from _Korea Advanced Institute of Science and Technology (KAIST)_ dataset.
Reference:
- Building Universal Dependency Treebanks in Korean, Jayeol Chun, Na-Rae Han, Jena D. Hwang, and Jinho D. Choi. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC'18, Miyazaki, Japan, 2018.
---
layout: model
title: Legal Choice of law Clause Binary Classifier (md)
author: John Snow Labs
name: legclf_choice_of_law_md
date: 2023-01-11
tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `choice-of-law` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `choice-of-law`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_choice_of_law_md_en_1.0.0_3.0_1673460245223.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_choice_of_law_md_en_1.0.0_3.0_1673460245223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[choice-of-law]|
|[other]|
|[other]|
|[choice-of-law]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_choice_of_law_md|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents + Lawinsider categorization
## Benchmarking
```bash
precision recall f1-score support
amendments-and-waivers 1.00 0.97 0.99 35
other 0.97 1.00 0.99 39
accuracy 0.99 74
macro avg 0.99 0.99 0.99 74
weighted avg 0.99 0.99 0.99 74
```
---
layout: model
title: English T5ForConditionalGeneration Cased model (from ThomasNLG)
author: John Snow Labs
name: t5_qa_webnlg_synth
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-qa_webnlg_synth-en` is a English model originally trained by `ThomasNLG`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_qa_webnlg_synth_en_4.3.0_3.0_1675125486836.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_qa_webnlg_synth_en_4.3.0_3.0_1675125486836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_qa_webnlg_synth","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_qa_webnlg_synth","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_qa_webnlg_synth|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|271.7 MB|
## References
- https://huggingface.co/ThomasNLG/t5-qa_webnlg_synth-en
- https://github.com/ThomasScialom/QuestEval
- https://arxiv.org/abs/2104.07555
---
layout: model
title: Legal Position and duties Clause Binary Classifier
author: John Snow Labs
name: legclf_position_and_duties_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `position-and-duties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `position-and-duties`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_position_and_duties_clause_en_1.0.0_3.2_1660122849201.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_position_and_duties_clause_en_1.0.0_3.2_1660122849201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[position-and-duties]|
|[other]|
|[other]|
|[position-and-duties]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_position_and_duties_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 1.00 0.99 0.99 92
position-and-duties 0.97 1.00 0.99 38
accuracy - - 0.99 130
macro-avg 0.99 0.99 0.99 130
weighted-avg 0.99 0.99 0.99 130
```
---
layout: model
title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili
date: 2022-08-01
tags: [sw, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: sw
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-igbo-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`.
## Predicted Entities
`PER`, `DATE`, `ORG`, `LOC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili_sw_4.1.0_3.0_1659354042745.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili_sw_4.1.0_3.0_1659354042745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili","sw") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili","sw")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|sw|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-igbo-finetuned-ner-swahili
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://github.com/masakhane-io/masakhane-ner
---
layout: model
title: Translate English to Mossi Pipeline
author: John Snow Labs
name: translate_en_mos
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, mos, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `mos`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mos_xx_2.7.0_2.4_1609686900824.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mos_xx_2.7.0_2.4_1609686900824.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_mos", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_mos", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.mos').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_mos|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English asr_wav2vec2_25_1Aug_2022 TFWav2Vec2ForCTC from Roshana
author: John Snow Labs
name: pipeline_asr_wav2vec2_25_1Aug_2022
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_25_1Aug_2022` is a English model originally trained by Roshana.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_25_1Aug_2022_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_25_1Aug_2022_en_4.2.0_3.0_1664116524444.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_25_1Aug_2022_en_4.2.0_3.0_1664116524444.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_25_1Aug_2022', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_25_1Aug_2022", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_25_1Aug_2022|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from google)
author: John Snow Labs
name: t5_efficient_base_dl4
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-dl4` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl4_en_4.3.0_3.0_1675110033833.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl4_en_4.3.0_3.0_1675110033833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_dl4","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_dl4","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_dl4|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|329.9 MB|
## References
- https://huggingface.co/google/t5-efficient-base-dl4
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Translate English to Basque (family) Pipeline
author: John Snow Labs
name: translate_en_euq
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, euq, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `euq`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_euq_xx_2.7.0_2.4_1609689947319.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_euq_xx_2.7.0_2.4_1609689947319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_euq", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_euq", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.euq').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_euq|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering model (from franklu)
author: John Snow Labs
name: bert_qa_pubmed_bert_squadv2
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `pubmed_bert_squadv2` is a English model orginally trained by `franklu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_pubmed_bert_squadv2_en_4.0.0_3.0_1654189059722.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_pubmed_bert_squadv2_en_4.0.0_3.0_1654189059722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_pubmed_bert_squadv2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_pubmed_bert_squadv2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_pubmed.bert.v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_pubmed_bert_squadv2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|408.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/franklu/pubmed_bert_squadv2
- https://rajpurkar.github.io/SQuAD-explorer/
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578 TFWav2Vec2ForCTC from doddle124578
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578` is a English model originally trained by doddle124578.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_en_4.2.0_3.0_1664037306955.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_en_4.2.0_3.0_1664037306955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Oncology Pipeline for Biomarkers
author: John Snow Labs
name: oncology_biomarker_pipeline
date: 2023-03-29
tags: [licensed, pipeline, oncology, biomarker, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline includes Named-Entity Recognition, Assertion Status and Relation Extraction models to extract information from oncology texts. This pipeline focuses on entities related to biomarkers.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_biomarker_pipeline_en_4.3.2_3.2_1680112789514.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_biomarker_pipeline_en_4.3.2_3.2_1680112789514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models")
text = '''Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models")
val text = "Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.oncology_biomarker.pipeline").predict("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_nitishkumar_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_nitishkumar_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_nitishkumar_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/NitishKumar/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Urdu Bert Embeddings (from Geotrend)
author: John Snow Labs
name: bert_embeddings_bert_base_ur_cased
date: 2022-04-11
tags: [bert, embeddings, ur, open_source]
task: Embeddings
language: ur
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-ur-cased` is a Urdu model orginally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ur_cased_ur_3.4.2_3.0_1649676499960.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ur_cased_ur_3.4.2_3.0_1649676499960.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_ur_cased","ur") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["مجھے سپارک این ایل پی سے محبت ہے"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_ur_cased","ur")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("مجھے سپارک این ایل پی سے محبت ہے").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ur.embed.bert_cased").predict("""مجھے سپارک این ایل پی سے محبت ہے""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_base_ur_cased|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ur|
|Size:|348.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-ur-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: Multilingual XLMRobertaForTokenClassification Cased model (from magistermilitum)
author: John Snow Labs
name: xlmroberta_ner_roberta_multilingual_medieval
date: 2022-08-13
tags: [xx, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: xx
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-multilingual-medieval-ner` is a Multilingual model originally trained by `magistermilitum`.
## Predicted Entities
`LOC`, `L-PERS`, `PERS`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_roberta_multilingual_medieval_xx_4.1.0_3.0_1660422872636.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_roberta_multilingual_medieval_xx_4.1.0_3.0_1660422872636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_roberta_multilingual_medieval","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_roberta_multilingual_medieval","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_roberta_multilingual_medieval|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|xx|
|Size:|1.8 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner
---
layout: model
title: English BertForQuestionAnswering Cased model (from motiondew)
author: John Snow Labs
name: bert_qa_set_date_1_lr_2e_5_bs_32_ep_3
date: 2022-07-07
tags: [en, open_source, bert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_1-lr-2e-5-bs-32-ep-3` is a English model originally trained by `motiondew`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_1_lr_2e_5_bs_32_ep_3_en_4.0.0_3.0_1657188305649.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_1_lr_2e_5_bs_32_ep_3_en_4.0.0_3.0_1657188305649.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_1_lr_2e_5_bs_32_ep_3","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_1_lr_2e_5_bs_32_ep_3","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_set_date_1_lr_2e_5_bs_32_ep_3|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|407.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/motiondew/bert-set_date_1-lr-2e-5-bs-32-ep-3
---
layout: model
title: Arabic ElectraForQuestionAnswering model (from aymanm419) Version-1
author: John Snow Labs
name: electra_qa_araElectra_SQUAD_ARCD
date: 2022-06-22
tags: [ar, open_source, electra, question_answering]
task: Question Answering
language: ar
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `araElectra-SQUAD-ARCD` is a Arabic model originally trained by `aymanm419`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_araElectra_SQUAD_ARCD_ar_4.0.0_3.0_1655920164320.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_araElectra_SQUAD_ARCD_ar_4.0.0_3.0_1655920164320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_araElectra_SQUAD_ARCD","ar") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_araElectra_SQUAD_ARCD","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.answer_question.squad_arcd.electra").predict("""ما هو اسمي؟|||"اسمي كلارا وأنا أعيش في بيركلي.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|electra_qa_araElectra_SQUAD_ARCD|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|ar|
|Size:|504.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aymanm419/araElectra-SQUAD-ARCD
---
layout: model
title: Translate English to Multiple languages Pipeline
author: John Snow Labs
name: translate_en_mul
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, mul, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `mul`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mul_xx_2.7.0_2.4_1609689926198.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mul_xx_2.7.0_2.4_1609689926198.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_mul", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_mul", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.mul').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_mul|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Fast Neural Machine Translation Model from English to Indo-European Languages
author: John Snow Labs
name: opus_mt_en_ine
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, ine, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `ine`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ine_xx_2.7.0_2.4_1609164123331.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ine_xx_2.7.0_2.4_1609164123331.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_ine", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_ine", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.ine').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_ine|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: RoBERTa Large CoNLL-03 NER Pipeline
author: ahmedlone127
name: roberta_large_token_classifier_conll03_pipeline
date: 2022-06-14
tags: [open_source, ner, token_classifier, roberta, conll03, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: false
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [roberta_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/roberta_large_token_classifier_conll03_en.html) model.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/roberta_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655220223619.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/roberta_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655220223619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs."))
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PERSON |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_large_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Community|
|Language:|en|
|Size:|1.3 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- RoBertaForTokenClassification
- NerConverter
- Finisher
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from motiondew)
author: John Snow Labs
name: distilbert_qa_motiondew_finetuned
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-finetuned` is a English model originally trained by `motiondew`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_motiondew_finetuned_en_4.3.0_3.0_1672774065577.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_motiondew_finetuned_en_4.3.0_3.0_1672774065577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_motiondew_finetuned","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_motiondew_finetuned","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_motiondew_finetuned|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/motiondew/distilbert-finetuned
---
layout: model
title: Legal Question Answering (Bert, Large)
author: John Snow Labs
name: legqa_bert_large
date: 2022-08-09
tags: [en, legal, qa, licensed]
task: Question Answering
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Legal Bert-based Question Answering model, trained on squad-v2, finetuned on proprietary Legal questions and answers.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legqa_bert_large_en_1.0.0_3.2_1660053509660.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legqa_bert_large_en_1.0.0_3.2_1660053509660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
spanClassifier = nlp.BertForQuestionAnswering.pretrained("legqa_bert_large","en", "legal/models") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = nlp.Pipeline().setStages([
documentAssembler,
spanClassifier
])
example = spark.createDataFrame([["Who was subjected to torture?", "The applicant submitted that her husband was subjected to treatment amounting to abuse whilst in the custody of police."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
result.select('answer.result').show()
```
## Results
```bash
`her husband`
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legqa_bert_large|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
Trained on squad-v2, finetuned on proprietary Legal questions and answers.
---
layout: model
title: RE Pipeline between Body Parts and Procedures
author: John Snow Labs
name: re_bodypart_proceduretest_pipeline
date: 2022-03-31
tags: [licensed, clinical, relation_extraction, body_part, procedures, en]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [re_bodypart_proceduretest](https://nlp.johnsnowlabs.com/2021/01/18/re_bodypart_proceduretest_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_BODYPART_ENT/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_proceduretest_pipeline_en_3.4.1_3.0_1648733647318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_proceduretest_pipeline_en_3.4.1_3.0_1648733647318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.bodypart_proceduretest.pipeline").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","yo") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Mo nifẹ Snark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","yo")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Mo nifẹ Snark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("yo.embed.w2v_cc_300d").predict("""Mo nifẹ Snark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|yo|
|Size:|85.4 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Clinical Deidentification Pipeline (Romanian)
author: John Snow Labs
name: clinical_deidentification
date: 2023-06-13
tags: [licensed, clinical, ro, deid, deidentification]
task: Pipeline Healthcare
language: ro
edition: Healthcare NLP 4.4.4
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline is trained with `w2v_cc_300d` Romanian embeddings and can be used to deidentify PHI information from medical texts in Romanian. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `CITY`, `COUNTRY`, `DATE`, `DOCTOR`, `EMAIL`, `FAX`, `HOSPITAL`, `IDNUM`, `LOCATION-OTHER`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `ZIP`, `ACCOUNT`, `LICENSE`, `PLATE`
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_ro_4.4.4_3.2_1686665695668.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_ro_4.4.4_3.2_1686665695668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "ro", "clinical/models")
sample = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022
Varsta : 77, Nume si Prenume : BUREAN MARIA
Tel: +40(235)413773, E-mail : hale@gmail.com,
Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999,
Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """
result = deid_pipeline.annotate(sample)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "ro", "clinical/models")
val sample = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022
Varsta : 77, Nume si Prenume : BUREAN MARIA
Tel: +40(235)413773, E-mail : hale@gmail.com,
Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999,
Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("ro.deid.clinical").predict("""Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022
Varsta : 77, Nume si Prenume : BUREAN MARIA
Tel: +40(235)413773, E-mail : hale@gmail.com,
Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999,
Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "ro", "clinical/models")
sample = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022
Varsta : 77, Nume si Prenume : BUREAN MARIA
Tel: +40(235)413773, E-mail : hale@gmail.com,
Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999,
Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """
result = deid_pipeline.annotate(sample)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "ro", "clinical/models")
val sample = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022
Varsta : 77, Nume si Prenume : BUREAN MARIA
Tel: +40(235)413773, E-mail : hale@gmail.com,
Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999,
Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """
val result = deid_pipeline.annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("ro.deid.clinical").predict("""Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022
Varsta : 77, Nume si Prenume : BUREAN MARIA
Tel: +40(235)413773, E-mail : hale@gmail.com,
Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999,
Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """)
```
## Results
```bash
Results
Masked with entity labels
------------------------------
Medic : Dr. , C.N.P : , Data setului de analize:
Varsta : , Nume si Prenume :
Tel: , E-mail : ,
Licență : , Înmatriculare : , Cont : ,
,
Masked with chars
------------------------------
Medic : Dr. [**********], C.N.P : [***********], Data setului de analize: [*********]
Varsta : **, Nume si Prenume : [**********]
Tel: [************], E-mail : [************],
Licență : [*********], Înmatriculare : [******], Cont : [******************],
[**************************] [******************] [****], [****]
Masked with fixed length chars
------------------------------
Medic : Dr. ****, C.N.P : ****, Data setului de analize: ****
Varsta : ****, Nume si Prenume : ****
Tel: ****, E-mail : ****,
Licență : ****, Înmatriculare : ****, Cont : ****,
**** **** ****, ****
Obfuscated
------------------------------
Medic : Dr. Doina Gheorghiu, C.N.P : 6794561192919, Data setului de analize: 01-04-2001
Varsta : 91, Nume si Prenume : Dragomir Emilia
Tel: 0248 551 376, E-mail : tudorsmaranda@kappa.ro,
Licență : T003485962M, Înmatriculare : AR-65-UPQ, Cont : KHHO5029180812813651,
Centrul Medical de Evaluare si Recuperare pentru Copii si Tineri Cristian Serban Buzias Aleea Voinea Curcani, 328479
{:.model-param}
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|clinical_deidentification|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.4.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|ro|
|Size:|1.2 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- ChunkMergeModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- Finisher
---
layout: model
title: English asr_wav2vec2_large_960h_lv60 TFWav2Vec2ForCTC from facebook
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_960h_lv60
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60` is a English model originally trained by facebook.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_960h_lv60_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_en_4.2.0_3.0_1664017406218.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_en_4.2.0_3.0_1664017406218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_960h_lv60', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_960h_lv60", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_960h_lv60|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|757.4 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_10_h_768
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-10_H-768` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_768_zh_4.2.4_3.0_1670021510353.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_768_zh_4.2.4_3.0_1670021510353.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_768","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_768","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_10_h_768|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|330.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-10_H-768
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Legal Amendments Clause Binary Classifier
author: John Snow Labs
name: legclf_amendments_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `amendments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `amendments`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_amendments_clause_en_1.0.0_3.2_1660122105448.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_amendments_clause_en_1.0.0_3.2_1660122105448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[intercreditor-agreement]|
|[other]|
|[other]|
|[intercreditor-agreement]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_intercreditor_agreement_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
## Benchmarking
```bash
label precision recall f1-score support
intercreditor-agreement 0.87 0.82 0.84 33
other 0.93 0.95 0.94 82
accuracy - - 0.91 115
macro-avg 0.90 0.88 0.89 115
weighted-avg 0.91 0.91 0.91 115
```
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from ModelTC)
author: John Snow Labs
name: roberta_qa_modeltc_base_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad` is a English model originally trained by `ModelTC`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_modeltc_base_squad_en_4.3.0_3.0_1674218615238.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_modeltc_base_squad_en_4.3.0_3.0_1674218615238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_modeltc_base_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_modeltc_base_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_modeltc_base_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.2 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/ModelTC/roberta-base-squad
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4_en_4.3.0_3.0_1674214918693.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4_en_4.3.0_3.0_1674214918693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|427.6 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-4
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42_en_4.3.0_3.0_1674214395082.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42_en_4.3.0_3.0_1674214395082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|425.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-42
---
layout: model
title: English RobertaForQuestionAnswering (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42_en_4.0.0_3.0_1655731309098.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42_en_4.0.0_3.0_1655731309098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_128d_seed_42").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|430.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-42
---
layout: model
title: German asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458` is a German model originally trained by jonatasgrosman.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458_de_4.2.0_3.0_1664117864895.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458_de_4.2.0_3.0_1664117864895.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: Portuguese Part of Speech Tagger (from Emanuel)
author: John Snow Labs
name: bert_pos_autonlp_pos_tag_bosque
date: 2022-05-09
tags: [bert, pos, part_of_speech, pt, open_source]
task: Part of Speech Tagging
language: pt
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `autonlp-pos-tag-bosque` is a Portuguese model orginally trained by `Emanuel`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_autonlp_pos_tag_bosque_pt_3.4.2_3.0_1652091764630.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_autonlp_pos_tag_bosque_pt_3.4.2_3.0_1652091764630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_autonlp_pos_tag_bosque","pt") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_autonlp_pos_tag_bosque","pt")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Eu amo Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_autonlp_pos_tag_bosque|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|pt|
|Size:|406.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/Emanuel/autonlp-pos-tag-bosque
---
layout: model
title: Portuguese Lemmatizer
author: John Snow Labs
name: lemma
date: 2020-05-03 12:54:00 +0800
task: Lemmatization
language: pt
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [lemmatizer, pt]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_pt_2.5.0_2.4_1588499301752.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_pt_2.5.0_2.4_1588499301752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
lemmatizer = LemmatizerModel.pretrained("lemma", "pt") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica.")
```
```scala
...
val lemmatizer = LemmatizerModel.pretrained("lemma", "pt")
.setInputCols(Array("token"))
.setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica."""]
lemma_df = nlu.load('pt.lemma').predict(text, output_level='token')
lemma_df.lemma.values[0]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=3, result='Além', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=5, end=6, result='de', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=8, end=10, result='ser', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=12, end=12, result='o', metadata={'sentence': '0'}, embeddings=[]),
Row(annotatorType='token', begin=14, end=16, result='rei', metadata={'sentence': '0'}, embeddings=[]),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|lemma|
|Type:|lemmatizer|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[lemma]|
|Language:|pt|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: Context Spell Checker Pipeline for English
author: John Snow Labs
name: spellcheck_dl_pipeline
date: 2022-04-18
tags: [spellcheck, spell, spellcheck_pipeline, spelling_corrector, en, open_source]
task: Spell Check
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 2.4
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained spellchecker pipeline is built on the top of [spellcheck_dl](https://nlp.johnsnowlabs.com/2022/04/02/spellcheck_dl_en_2_4.html) model. This pipeline is for PySpark 2.4.x users with SparkNLP 3.4.2 and above.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_pipeline_en_3.4.2_2.4_1650285592232.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_pipeline_en_3.4.2_2.4_1650285592232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("spellcheck_dl_pipeline", lang = "en")
text = ["During the summer we have the best ueather.", "I have a black ueather jacket, so nice."]
pipeline.annotate(text)
```
```scala
val pipeline = new PretrainedPipeline("spellcheck_dl_pipeline", lang = "en")
val example = Array("During the summer we have the best ueather.", "I have a black ueather jacket, so nice.")
pipeline.annotate(example)
```
## Results
```bash
[{'checked': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'weather', '.'],
'document': ['During the summer we have the best ueather.'],
'token': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'ueather', '.']},
{'checked': ['I', 'have', 'a', 'black', 'leather', 'jacket', ',', 'so', 'nice', '.'],
'document': ['I have a black ueather jacket, so nice.'],
'token': ['I', 'have', 'a', 'black', 'ueather', 'jacket', ',', 'so', 'nice', '.']}]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|spellcheck_dl_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|99.4 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- ContextSpellCheckerModel
---
layout: model
title: Word2Vec Embeddings in Chuvash (300d)
author: John Snow Labs
name: w2v_cc_300d
date: 2022-03-14
tags: [cc, embeddings, fastText, word2vec, cv, open_source]
task: Embeddings
language: cv
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Word Embeddings lookup annotator that maps tokens to vectors.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_cv_3.4.1_3.0_1647291066615.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_cv_3.4.1_3.0_1647291066615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","cv") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","cv")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("cv.embed.w2v_cc_300d").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|w2v_cc_300d|
|Type:|embeddings|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|cv|
|Size:|251.9 MB|
|Case sensitive:|false|
|Dimension:|300|
---
layout: model
title: Translate Finnish to English Pipeline
author: John Snow Labs
name: translate_fi_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, fi, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `fi`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_fi_en_xx_2.7.0_2.4_1609698992009.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_fi_en_xx_2.7.0_2.4_1609698992009.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_fi_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_fi_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.fi.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_fi_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Russian T5ForConditionalGeneration Small Cased model (from cointegrated)
author: John Snow Labs
name: t5_rut5_small_chitchat
date: 2023-01-30
tags: [ru, open_source, t5, tensorflow]
task: Text Generation
language: ru
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rut5-small-chitchat` is a Russian model originally trained by `cointegrated`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_rut5_small_chitchat_ru_4.3.0_3.0_1675106805725.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_rut5_small_chitchat_ru_4.3.0_3.0_1675106805725.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_rut5_small_chitchat","ru") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_rut5_small_chitchat","ru")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_rut5_small_chitchat|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|ru|
|Size:|277.4 MB|
## References
- https://huggingface.co/cointegrated/rut5-small-chitchat
---
layout: model
title: Legal Further Assurances Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_further_assurances_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, further_assurances, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Further_Assurances` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Further_Assurances`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_further_assurances_bert_en_1.0.0_3.0_1678050719026.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_further_assurances_bert_en_1.0.0_3.0_1678050719026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Further_Assurances]|
|[Other]|
|[Other]|
|[Further_Assurances]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_further_assurances_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Further_Assurances 0.93 0.97 0.95 120
Other 0.98 0.94 0.96 147
accuracy - - 0.96 267
macro-avg 0.95 0.96 0.95 267
weighted-avg 0.96 0.96 0.96 267
```
---
layout: model
title: Classifier for Adverse Drug Events in Small Conversations
author: John Snow Labs
name: classifierdl_ade_conversational_biobert
date: 2021-01-21
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 2.7.1
spark_version: 2.4
tags: [en, licensed, classifier, clinical]
supported: true
annotator: ClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Classify sentence in two categories:
`True` : The sentence is talking about a possible ADE
`False` : The sentences doesn’t have any information about an ADE.
## Predicted Entities
`True`, `False`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_ade_conversational_biobert_en_2.7.1_2.4_1611246389884.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_ade_conversational_biobert_en_2.7.1_2.4_1611246389884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(['document']).setOutputCol('token')
embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\
.setInputCols(["document", 'token'])\
.setOutputCol("word_embeddings")
sentence_embeddings = SentenceEmbeddings() \
.setInputCols(["document", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
classifier = ClassifierDLModel.pretrained('classifierdl_ade_conversational_biobert', 'en', 'clinical/models')\
.setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('class')
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, sentence_embeddings, classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate(["I feel a bit drowsy & have a little blurred vision after taking an insulin", "I feel great after taking tylenol"])
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.ade.conversational").predict("""I feel a bit drowsy & have a little blurred vision after taking an insulin""")
```
## Results
```bash
| | text | label |
|--:|:---------------------------------------------------------------------------|:------|
| 0 | I feel a bit drowsy & have a little blurred vision after taking an insulin | True |
| 1 | I feel great after taking tylenol | False |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|classifierdl_ade_conversational_biobert|
|Compatibility:|Spark NLP 2.7.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Dependencies:|biobert_pubmed_base_cased|
## Data Source
Trained on a custom dataset comprising of CADEC, DRUG-AE and Twimed.
## Benchmarking
```bash
precision recall f1-score support
False 0.91 0.94 0.93 5706
True 0.80 0.70 0.74 1800
micro avg 0.89 0.89 0.89 7506
macro avg 0.85 0.82 0.84 7506
weighted avg 0.88 0.89 0.88 7506
```
---
layout: model
title: Smaller BERT Sentence Embeddings (L-10_H-768_A-12)
author: John Snow Labs
name: sent_small_bert_L10_768
date: 2020-08-25
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [open_source, embeddings, en]
supported: true
annotator: BertSentenceEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_768_en_2.6.0_2.4_1598351479319.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_768_en_2.6.0_2.4_1598351479319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_768", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))
```
```scala
...
val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_768", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.small_bert_L10_768').predict(text, output_level='sentence')
embeddings_df
```
{:.h2_title}
## Results
```bash
en_embed_sentence_small_bert_L10_768_embeddings sentence
[-0.6537564396858215, -0.2422734946012497, -0.... I hate cancer
[0.06436929106712341, -0.34515661001205444, 0.... Antibiotics aren't painkiller
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sent_small_bert_L10_768|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.6.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[sentence_embeddings]|
|Language:|[en]|
|Dimension:|768|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1
---
layout: model
title: Pipeline to Extract Negation and Uncertainty Entities from Spanish Medical Texts
author: John Snow Labs
name: ner_negation_uncertainty_pipeline
date: 2023-03-09
tags: [es, clinical, licensed, ner, unc, usco, neg, nsco, negation, uncertainty]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_negation_uncertainty](https://nlp.johnsnowlabs.com/2022/08/13/ner_negation_uncertainty_es_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_negation_uncertainty_pipeline_es_4.3.0_3.2_1678359171669.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_negation_uncertainty_pipeline_es_4.3.0_3.2_1678359171669.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_negation_uncertainty_pipeline", "es", "clinical/models")
text = '''e realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_negation_uncertainty_pipeline", "es", "clinical/models")
val text = "e realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa)."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
+------------------------------------------------------+---------+
|chunk |ner_label|
+------------------------------------------------------+---------+
|probable de |UNC |
|cirrosis hepática |USCO |
|no |NEG |
|conocida previamente |NSCO |
|no |NEG |
|se realizó paracentesis control por escasez de liquido|NSCO |
|susceptible de |UNC |
|ca basocelular perlado |USCO |
+------------------------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_negation_uncertainty_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|es|
|Size:|318.6 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- RoBertaEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: Sentence Detection in Bosnian Text
author: John Snow Labs
name: sentence_detector_dl
date: 2021-08-30
tags: [bs, sentence_detection, open_source]
task: Sentence Detection
language: bs
edition: Spark NLP 3.2.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_bs_3.2.0_3.0_1630317779410.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_bs_3.2.0_3.0_1630317779410.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl", "bs") \
.setInputCols(["document"]) \
.setOutputCol("sentences")
sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL]))
sd_model.fullAnnotate("""Tražite sjajan izvor čitanja odlomaka na engleskom? Došli ste na pravo mjesto. Prema nedavnom istraživanju, navika čitanja u današnjoj mladosti brzo se smanjuje. Ne mogu se usredotočiti na dati odlomak za čitanje engleskog jezika duže od nekoliko sekundi! Takođe, čitanje je bilo i jeste sastavni dio svih takmičarskih ispita. Dakle, kako poboljšati svoje vještine čitanja? Odgovor na ovo pitanje zapravo je drugo pitanje: Kakva je korist od vještine čitanja? Glavna svrha čitanja je 'imati smisla'.""")
```
```scala
val documenter = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "bs")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val pipeline = new Pipeline().setStages(Array(documenter, model))
val data = Seq("Tražite sjajan izvor čitanja odlomaka na engleskom? Došli ste na pravo mjesto. Prema nedavnom istraživanju, navika čitanja u današnjoj mladosti brzo se smanjuje. Ne mogu se usredotočiti na dati odlomak za čitanje engleskog jezika duže od nekoliko sekundi! Takođe, čitanje je bilo i jeste sastavni dio svih takmičarskih ispita. Dakle, kako poboljšati svoje vještine čitanja? Odgovor na ovo pitanje zapravo je drugo pitanje: Kakva je korist od vještine čitanja? Glavna svrha čitanja je 'imati smisla'.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
nlu.load('bs.sentence_detector').predict("Tražite sjajan izvor čitanja odlomaka na engleskom? Došli ste na pravo mjesto. Prema nedavnom istraživanju, navika čitanja u današnjoj mladosti brzo se smanjuje. Ne mogu se usredotočiti na dati odlomak za čitanje engleskog jezika duže od nekoliko sekundi! Takođe, čitanje je bilo i jeste sastavni dio svih takmičarskih ispita. Dakle, kako poboljšati svoje vještine čitanja? Odgovor na ovo pitanje zapravo je drugo pitanje: Kakva je korist od vještine čitanja? Glavna svrha čitanja je 'imati smisla'.", output_level ='sentence')
```
## Results
```bash
+-----------------------------------------------------------------------------------------------+
|result |
+-----------------------------------------------------------------------------------------------+
|[Tražite sjajan izvor čitanja odlomaka na engleskom?] |
|[Došli ste na pravo mjesto.] |
|[Prema nedavnom istraživanju, navika čitanja u današnjoj mladosti brzo se smanjuje.] |
|[Ne mogu se usredotočiti na dati odlomak za čitanje engleskog jezika duže od nekoliko sekundi!]|
|[Takođe, čitanje je bilo i jeste sastavni dio svih takmičarskih ispita.] |
|[Dakle, kako poboljšati svoje vještine čitanja?] |
|[Odgovor na ovo pitanje zapravo je drugo pitanje:] |
|[Kakva je korist od vještine čitanja?] |
|[Glavna svrha čitanja je 'imati smisla'.] |
+-----------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sentence_detector_dl|
|Compatibility:|Spark NLP 3.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[sentences]|
|Language:|bs|
## Benchmarking
```bash
label Accuracy Recall Prec F1
0 0.98 1.00 0.96 0.98
```
---
layout: model
title: Italian BertForMaskedLM Base Cased model (from dbmdz)
author: John Snow Labs
name: bert_embeddings_base_italian_xxl_cased
date: 2022-12-02
tags: [it, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: it
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-italian-xxl-cased` is a Italian model originally trained by `dbmdz`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_xxl_cased_it_4.2.4_3.0_1670017995735.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_xxl_cased_it_4.2.4_3.0_1670017995735.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_xxl_cased","it") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_xxl_cased","it")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_italian_xxl_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|it|
|Size:|415.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/dbmdz/bert-base-italian-xxl-cased
- http://opus.nlpl.eu/
- https://traces1.inria.fr/oscar/
- https://github.com/dbmdz/berts/issues/7
- https://github.com/stefan-it/turkish-bert/tree/master/electra
- https://github.com/stefan-it/italian-bertelectra
- https://github.com/dbmdz/berts/issues/new
---
layout: model
title: Telugu BertForMaskedLM Cased model (from neuralspace-reverie)
author: John Snow Labs
name: bert_embeddings_indic_transformers
date: 2022-12-06
tags: [te, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: te
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-te-bert` is a Telugu model originally trained by `neuralspace-reverie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_te_4.2.4_3.0_1670326679548.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_te_4.2.4_3.0_1670326679548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","te") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","te")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_indic_transformers|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|te|
|Size:|611.9 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/neuralspace-reverie/indic-transformers-te-bert
- https://oscar-corpus.com/
---
layout: model
title: English T5ForConditionalGeneration Cased model (from benjamyu)
author: John Snow Labs
name: t5_autotrain_ms_2_1174443640
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-ms-2-1174443640` is a English model originally trained by `benjamyu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_autotrain_ms_2_1174443640_en_4.3.0_3.0_1675099983295.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_autotrain_ms_2_1174443640_en_4.3.0_3.0_1675099983295.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_autotrain_ms_2_1174443640","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_autotrain_ms_2_1174443640","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_autotrain_ms_2_1174443640|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|916.0 MB|
## References
- https://huggingface.co/benjamyu/autotrain-ms-2-1174443640
---
layout: model
title: English BertForTokenClassification Cased model (from Lucifermorningstar011)
author: John Snow Labs
name: bert_token_classifier_autotrain_final_784824206
date: 2022-11-30
tags: [en, open_source, bert, token_classification, ner, tensorflow]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824206` is a English model originally trained by `Lucifermorningstar011`.
## Predicted Entities
`9`, `0`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_final_784824206_en_4.2.4_3.0_1669814522383.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_final_784824206_en_4.2.4_3.0_1669814522383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_final_784824206","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_final_784824206","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_autotrain_final_784824206|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/Lucifermorningstar011/autotrain-final-784824206
---
layout: model
title: English BertForQuestionAnswering model (from jackh1995)
author: John Snow Labs
name: bert_qa_bert_finetuned_jackh1995
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned` is a English model orginally trained by `jackh1995`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_jackh1995_en_4.0.0_3.0_1654534832705.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_jackh1995_en_4.0.0_3.0_1654534832705.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_jackh1995","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_finetuned_jackh1995","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.by_jackh1995").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_finetuned_jackh1995|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|381.4 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/jackh1995/bert-finetuned
---
layout: model
title: Named Entity Recognition (NER) Model in Swedish (GloVe 840B 300)
author: John Snow Labs
name: swedish_ner_840B_300
date: 2020-08-30
task: Named Entity Recognition
language: sv
edition: Spark NLP 2.6.0
spark_version: 2.4
tags: [ner, sv, open_source]
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Swedish NER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. The model is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline.
{:.h2_title}
## Predicted Entities
Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Product-`PRO`, Date-`DATE`, Event-`EVENT`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_SV/){:.button.button-orange}{:target="_blank"}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/swedish_ner_840B_300_sv_2.6.0_2.4_1598810268072.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/swedish_ner_840B_300_sv_2.6.0_2.4_1598810268072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang = "xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained("swedish_ner_840B_300", "sv") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text'))
result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella.']], ["text"]))
```
```scala
...
val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang = "xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("swedish_ner_840B_300", "sv")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella."""]
ner_df = nlu.load('sv.ner.840B_300').predict(text, output_level = "chunk")
ner_df[["entities", "entities_confidence"]]
```
{:.h2_title}
## Results
```bash
+------------------------+---------+
|chunk |ner_label|
+------------------------+---------+
|William Henry Gates |PER |
|Microsoft Corporation |ORG |
|Microsoft |ORG |
|Gates |MISC |
|Seattle |LOC |
|Washington |LOC |
|Gates Microsoft |ORG |
|Paul Allen |PER |
|Albuquerque |LOC |
|New Mexico |MISC |
|Gates |MISC |
|Gates |MISC |
|Gates |MISC |
|Microsoft |ORG |
|Bill |MISC |
|Melinda Gates Foundation|MISC |
|Melinda Gates |MISC |
|Ray Ozzie |PER |
|Craig Mundie |PER |
|Microsoft |ORG |
+------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|swedish_ner_840B_300|
|Type:|ner|
|Compatibility:| Spark NLP 2.6.0+|
|Edition:|Official|
|License:|Open Source|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|sv|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on a custom dataset with multi-lingual GloVe Embeddings ``glove_840B_300``.
---
layout: model
title: Part of Speech for Finnish
author: John Snow Labs
name: pos_ud_tdt
date: 2020-05-04 23:32:00 +0800
task: Part of Speech Tagging
language: fi
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [pos, fi]
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_tdt_fi_2.5.0_2.4_1588622348985.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_tdt_fi_2.5.0_2.4_1588622348985.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
pos = PerceptronModel.pretrained("pos_ud_tdt", "fi") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.")
```
```scala
...
val pos = PerceptronModel.pretrained("pos_ud_tdt", "fi")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä."""]
pos_df = nlu.load('fi.pos.ud_tdt').predict(text, output_level='token')
pos_df
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='pos', begin=0, end=2, result='PRON', metadata={'word': 'Sen'}),
Row(annotatorType='pos', begin=4, end=10, result='ADP', metadata={'word': 'lisäksi'}),
Row(annotatorType='pos', begin=11, end=11, result='PUNCT', metadata={'word': ','}),
Row(annotatorType='pos', begin=13, end=16, result='SCONJ', metadata={'word': 'että'}),
Row(annotatorType='pos', begin=18, end=20, result='PRON', metadata={'word': 'hän'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_tdt|
|Type:|pos|
|Compatibility:|Spark NLP 2.5.0+|
|Edition:|Official|
|Input labels:|[token]|
|Output labels:|[pos]|
|Language:|fi|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://universaldependencies.org](https://universaldependencies.org)
---
layout: model
title: PICO Classifier (BERT)
author: John Snow Labs
name: bert_sequence_classifier_pico_biobert
date: 2022-02-07
tags: [bert, sequence_classification, en, licensed]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: MedicalBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Classify medical text according to the PICO framework.
This model is a [BioBERT-based](https://github.com/dmis-lab/biobert) classifier.
## Predicted Entities
`CONCLUSIONS`, `DESIGN_SETTING`, `INTERVENTION`, `PARTICIPANTS`, `FINDINGS`, `MEASUREMENTS`, `AIMS`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_pico_biobert_en_3.4.1_3.0_1644265236813.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_pico_biobert_en_3.4.1_3.0_1644265236813.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_pico", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame([["To compare the results of recording enamel opacities using the TF and modified DDE indices."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_pico_biobert", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))
val data = Seq("""To compare the results of recording enamel opacities using the TF and modified DDE indices.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.pico.seq_biobert").predict("""To compare the results of recording enamel opacities using the TF and modified DDE indices.""")
```
## Results
```bash
+-------------------------------------------------------------------------------------------+------+
|text |result|
+-------------------------------------------------------------------------------------------+------+
|To compare the results of recording enamel opacities using the TF and modified DDE indices.|[AIMS]|
+-------------------------------------------------------------------------------------------+------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_pico_biobert|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|406.0 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
This model is trained on a custom dataset derived from a PICO classification dataset.
## Benchmarking
```bash
label precision recall f1-score support
AIMS 0.92 0.94 0.93 3813
CONCLUSIONS 0.85 0.86 0.86 4314
DESIGN_SETTING 0.88 0.78 0.83 5628
FINDINGS 0.91 0.92 0.91 9242
INTERVENTION 0.71 0.78 0.74 2331
MEASUREMENTS 0.79 0.87 0.83 3219
PARTICIPANTS 0.86 0.81 0.83 2723
accuracy - - 0.86 31270
macro-avg 0.85 0.85 0.85 31270
weighted-avg 0.87 0.86 0.86 31270
```
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from benny6)
author: John Snow Labs
name: roberta_qa_tydi
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-tydiqa` is a English model originally trained by `benny6`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tydi_en_4.3.0_3.0_1674222584111.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tydi_en_4.3.0_3.0_1674222584111.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tydi","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tydi","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_tydi|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|471.7 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/benny6/roberta-tydiqa
---
layout: model
title: Legal Cooperation Clause Binary Classifier
author: John Snow Labs
name: legclf_cooperation_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `cooperation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `cooperation`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cooperation_clause_en_1.0.0_3.2_1660122299955.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cooperation_clause_en_1.0.0_3.2_1660122299955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[cooperation]|
|[other]|
|[other]|
|[cooperation]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_cooperation_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.9 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
cooperation 0.91 0.94 0.93 34
other 0.98 0.97 0.97 96
accuracy - - 0.96 130
macro-avg 0.95 0.95 0.95 130
weighted-avg 0.96 0.96 0.96 130
```
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab_by_ali221000262 TFWav2Vec2ForCTC from ali221000262
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_ali221000262` is a English model originally trained by ali221000262.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262_en_4.2.0_3.0_1664036650196.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262_en_4.2.0_3.0_1664036650196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|354.9 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Sentence Entity Resolver for billable ICD10-CM HCC codes
author: John Snow Labs
name: sbiobertresolve_icd10cm_augmented_billable_hcc
date: 2021-02-06
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 2.7.3
spark_version: 2.4
tags: [licensed, clinical, en, entity_resolution]
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD10-CM codes using chunk embeddings (augmented with synonyms, four times richer than previous resolver). It also adds support of 7-digit codes with HCC status.
## Predicted Entities
Outputs 7-digit billable ICD codes. In the result, look for `aux_label` parameter in the metadata to get HCC status. The HCC status can be divided to get further information: `billable status`, `hcc status`, and `hcc score`.
For example, in the example shared `below the billable status is 1`, `hcc status is 1`, and `hcc score is 8`.
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_2.7.3_2.4_1612609178670.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_2.7.3_2.4_1612609178670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
```sbiobertresolve_icd10cm_augmented_billable_hcc``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. ```PROBLEM``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \
.setInputCols(["document", "sbert_embeddings"]) \
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN").setReturnCosineDistances(True)
bert_pipeline_icd = Pipeline(stages = [document_assembler, sbert_embedder, icd10_resolver])
data = spark.createDataFrame([["metastatic lung cancer"]]).toDF("text")
results = bert_pipeline_icd.fit(data).transform(data)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models")
.setInputCols(Array("document", "sbert_embeddings"))
.setOutputCol("icd10cm_code")
.setDistanceFunction("EUCLIDEAN")
.setReturnCosineDistances(True)
val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver))
val data = Seq("metastatic lung cancer").toDF("text")
val result = bert_pipeline_icd.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10cm.augmented_billable").predict("""metastatic lung cancer""")
```
## Results
```bash
| | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances |
|---:|:-----------------------|:-------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------|
| 0 | metastatic lung cancer | C7800 | ['cancer metastatic to lung', 'metastasis from malignant tumor of lung', 'cancer metastatic to left lung', 'history of cancer metastatic to lung', 'metastatic cancer', 'history of cancer metastatic to lung (situation)', 'metastatic adenocarcinoma to bilateral lungs', 'cancer metastatic to chest wall', 'metastatic malignant neoplasm to left lower lobe of lung', 'metastatic carcinoid tumour', 'cancer metastatic to respiratory tract', 'metastatic carcinoid tumor'] | ['C7800', 'C349', 'C7801', 'Z858', 'C800', 'Z8511', 'C780', 'C798', 'C7802', 'C799', 'C7830', 'C7B00'] | ['1', '1', '8'] | ['0.0464', '0.0829', '0.0852', '0.0860', '0.0914', '0.0989', '0.1133', '0.1220', '0.1220', '0.1253', '0.1249', '0.1260'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10cm_augmented_billable_hcc|
|Compatibility:|Healthcare NLP 2.7.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icd10cm_code]|
|Language:|en|
---
layout: model
title: Detect PHI for Generic Deidentification in Romanian (BERT)
author: John Snow Labs
name: ner_deid_generic_bert
date: 2022-08-15
tags: [licensed, clinical, ro, deidentification, phi, generic, bert]
task: Named Entity Recognition
language: ro
edition: Healthcare NLP 4.0.2
spark_version: 3.0
supported: true
recommended: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators to allow a generic model to be trained by using a Deep Learning architecture (Char CNN's - BiLSTM - CRF - word embeddings) inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM CNN.
Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with bert_base_cased embeddings and can detect 7 generic entities.
This NER model is trained with a combination of custom datasets with several data augmentation mechanisms.
## Predicted Entities
`AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.0.2_3.0_1660551174367.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.0.2_3.0_1660551174367.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\
.setInputCols(["sentence","token"])\
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter])
text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""
data = spark.createDataFrame([[text]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter))
val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""
val data = Seq(text).toDS.toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ro.med_ner.deid_generic_bert").predict("""
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401""")
```
## Results
```bash
+----------------------------+---------+
|chunk |ner_label|
+----------------------------+---------+
|Spitalul Pentru Ochi de Deal|LOCATION |
|Drumul Oprea Nr |LOCATION |
|972 |LOCATION |
|Vaslui |LOCATION |
|737405 |LOCATION |
|+40(235)413773 |CONTACT |
|25 May 2022 |DATE |
|BUREAN MARIA |NAME |
|77 |AGE |
|Agota Evelyn Tımar |NAME |
|2450502264401 |ID |
+----------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_generic_bert|
|Compatibility:|Healthcare NLP 4.0.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ro|
|Size:|16.3 MB|
## References
- Custom John Snow Labs datasets
- Data augmentation techniques
## Benchmarking
```bash
label precision recall f1-score support
AGE 0.95 0.97 0.96 1186
CONTACT 0.99 0.98 0.98 366
DATE 0.96 0.92 0.94 4518
ID 1.00 1.00 1.00 679
LOCATION 0.91 0.90 0.90 1683
NAME 0.93 0.96 0.94 2916
PROFESSION 0.87 0.85 0.86 161
micro-avg 0.94 0.94 0.94 11509
macro-avg 0.94 0.94 0.94 11509
weighted-avg 0.95 0.94 0.94 11509
```
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from describeai)
author: John Snow Labs
name: t5_gemini_small
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gemini-small` is a English model originally trained by `describeai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_gemini_small_en_4.3.0_3.0_1675102559187.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_gemini_small_en_4.3.0_3.0_1675102559187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_gemini_small","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_gemini_small","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_gemini_small|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|916.0 MB|
## References
- https://huggingface.co/describeai/gemini-small
- https://www.describe-ai.com/gemini
---
layout: model
title: Fast Neural Machine Translation Model from Afrikaans to English
author: John Snow Labs
name: opus_mt_af_en
date: 2021-06-01
tags: [open_source, seq2seq, translation, af, en, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: af
target languages: en
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_en_xx_3.1.0_2.4_1622558064281.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_en_xx_3.1.0_2.4_1622558064281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_af_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_af_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Afrikaans.translate_to.English').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_af_en|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Japanese Bert Embeddings (Large)
author: John Snow Labs
name: bert_embeddings_bert_large_japanese_char_extended
date: 2022-04-11
tags: [bert, embeddings, ja, open_source]
task: Embeddings
language: ja
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-japanese-char-extended` is a Japanese model orginally trained by `KoichiYasuoka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_japanese_char_extended_ja_3.4.2_3.0_1649674799994.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_japanese_char_extended_ja_3.4.2_3.0_1649674799994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese_char_extended","ja") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese_char_extended","ja")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("私はSpark NLPを愛しています").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ja.embed.bert_large_japanese_char_extended").predict("""私はSpark NLPを愛しています""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_bert_large_japanese_char_extended|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ja|
|Size:|1.2 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/KoichiYasuoka/bert-large-japanese-char-extended
---
layout: model
title: Javanese RoBERTa Embeddings (Small, IMDB Movie Review)
author: John Snow Labs
name: roberta_embeddings_javanese_roberta_small_imdb
date: 2022-04-14
tags: [roberta, embeddings, jv, open_source]
task: Embeddings
language: jv
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `javanese-roberta-small-imdb` is a Javanese model orginally trained by `w11wo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_javanese_roberta_small_imdb_jv_3.4.2_3.0_1649948176711.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_javanese_roberta_small_imdb_jv_3.4.2_3.0_1649948176711.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_javanese_roberta_small_imdb","jv") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_javanese_roberta_small_imdb","jv")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("jv.embed.javanese_roberta_small_imdb").predict("""I love Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_javanese_roberta_small_imdb|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|jv|
|Size:|468.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/w11wo/javanese-roberta-small-imdb
- https://arxiv.org/abs/1907.11692
- https://github.com/sgugger
- https://w11wo.github.io/
---
layout: model
title: Embeddings Clinical (Medium)
author: John Snow Labs
name: embeddings_clinical_medium
date: 2023-04-07
tags: [licensed, en, clinical, embeddings]
task: Embeddings
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.0
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained on a list of clinical and biomedical datasets curated in-house, using the word2vec algorithm. The dataset curation cut-off date is March 2023 and the model is expected to have a better generalization on recent content. The size of the model is around 1 GB and has 200 dimensions. Our benchmark tests indicate that our legacy clinical embeddings (embeddings_clinical) can be replaced with this one while training a new model (existing/previous models will still need to use the legacy embeddings that they're trained with).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_clinical_medium_en_4.3.2_3.0_1680835759101.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_clinical_medium_en_4.3.2_3.0_1680835759101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium","en","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
```
```scala
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium","en","clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("word_embeddings")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|embeddings_clinical_medium|
|Type:|embeddings|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[word_embeddings]|
|Language:|en|
|Size:|787.5 MB|
|Case sensitive:|true|
|Dimension:|200|
---
layout: model
title: Swedish asr_wav2vec2_large_xlsr_swedish TFWav2Vec2ForCTC from marma
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_swedish
date: 2022-09-25
tags: [wav2vec2, sv, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: sv
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_swedish` is a Swedish model originally trained by marma.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_swedish_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_swedish_sv_4.2.0_3.0_1664118734633.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_swedish_sv_4.2.0_3.0_1664118734633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_swedish', lang = 'sv')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_swedish", lang = "sv")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_swedish|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|sv|
|Size:|756.1 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Drug Substance to UMLS Code Pipeline
author: John Snow Labs
name: umls_drug_substance_resolver_pipeline
date: 2023-04-11
tags: [licensed, clinical, en, umls, pipeline, drug, subtance]
task: Chunk Mapping
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline maps entities (Drug Substances) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_drug_substance_resolver_pipeline_en_4.3.2_3.0_1681217098344.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_drug_substance_resolver_pipeline_en_4.3.2_3.0_1681217098344.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("umls_drug_substance_resolver_pipeline", "en", "clinical/models")
result = pipeline.annotate("The patient was given metformin, lenvatinib and Magnesium hydroxide 100mg/1ml")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = PretrainedPipeline("umls_drug_substance_resolver_pipeline", "en", "clinical/models")
val result = pipeline.annotate("The patient was given metformin, lenvatinib and Magnesium hydroxide 100mg/1ml")
```
{:.nlu-block}
```python
+-----------------------------+---------+---------+
|chunk |ner_label|umls_code|
+-----------------------------+---------+---------+
|metformin |DRUG |C0025598 |
|lenvatinib |DRUG |C2986924 |
|Magnesium hydroxide 100mg/1ml|DRUG |C1134402 |
+-----------------------------+---------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|umls_drug_substance_resolver_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|5.1 GB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- ChunkMapperModel
- ChunkMapperModel
- ChunkMapperFilterer
- Chunk2Doc
- BertSentenceEmbeddings
- SentenceEntityResolverModel
- ResolverMerger
---
layout: model
title: Multilingual DistilBertForQuestionAnswering Base Cased model (from monakth)
author: John Snow Labs
name: distilbert_qa_monakth_base_case_finetuned_squad
date: 2023-01-03
tags: [xx, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: xx
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multilingual-cased-finetuned-squad` is a Multilingual model originally trained by `monakth`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_monakth_base_case_finetuned_squad_xx_4.3.0_3.0_1672767194965.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_monakth_base_case_finetuned_squad_xx_4.3.0_3.0_1672767194965.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_monakth_base_case_finetuned_squad","xx")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_monakth_base_case_finetuned_squad","xx")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_monakth_base_case_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|xx|
|Size:|505.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/monakth/distilbert-base-multilingual-cased-finetuned-squad
---
layout: model
title: Pipeline to Extract Pharmacological Entities From Spanish Medical Texts (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_pharmacology_pipeline
date: 2023-03-20
tags: [es, clinical, licensed, token_classification, bert, ner, pharmacology]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_pharmacology](https://nlp.johnsnowlabs.com/2022/08/11/bert_token_classifier_pharmacology_es_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_pharmacology_pipeline_es_4.3.0_3.2_1679298404485.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_pharmacology_pipeline_es_4.3.0_3.2_1679298404485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_pharmacology_pipeline", "es", "clinical/models")
text = '''Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_pharmacology_pipeline", "es", "clinical/models")
val text = "Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa)."
val result = pipeline.fullAnnotate(text)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
```scala
val pipeline = new PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")
```
## Results
```bash
+--------------+---------+
|chunk |ner_label|
+--------------+---------+
|John |PERSON |
|John Snow Labs|ORG |
+--------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlnet_large_token_classifier_conll03_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|32.1 KB|
## Included Models
- DocumentAssembler
- TokenizerModel
- NormalizerModel
---
layout: model
title: Detect PHI for Deidentification purposes (Spanish, Roberta, augmented)
author: John Snow Labs
name: ner_deid_subentity_roberta_augmented
date: 2022-02-15
tags: [deid, es, licensed]
task: De-identification
language: es
edition: Healthcare NLP 3.3.4
spark_version: 2.4
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities, which is more than the previously released `ner_deid_subentity_roberta` model.
This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf and MeddoCan datasets, and includes several data augmentation mechanisms.
This is a version that includes Roberta Clinical embeddings. You can find as well `ner_deid_subentity_augmented` that uses Sciwi 300d embeddings based instead of Roberta.
## Predicted Entities
`PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `USERNAME`, `SEX`, `EMAIL`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_roberta_augmented_es_3.3.4_2.4_1644927666923.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_roberta_augmented_es_3.3.4_2.4_1644927666923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_roberta_augmented", "es", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
roberta_embeddings,
clinical_ner])
text = ['''
Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
''']
df = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(df).transform(df)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_roberta_augmented", "es", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner))
val text = "Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."
val df = Seq(text).toDF("text")
val results = pipeline.fit(df).transform(df)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.med_ner.deid.subentity.roberta").predict("""
Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_docket_language_model","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_docket_language_model","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|deberta_embeddings_docket_language_model|
|Compatibility:|Spark NLP 4.3.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.6 GB|
|Case sensitive:|false|
## References
https://huggingface.co/scales-okn/docket-language-model
---
layout: model
title: Detect Person, Location, Organization, and Miscellaneous entities in Arabic (ANERcorp)
author: John Snow Labs
name: aner_cc_300d
date: 2022-07-26
tags: [ner, ar, open_source]
task: Named Entity Recognition
language: ar
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: NerDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model uses Arabic word embeddings to find 4 different types of entities in Arabic text. It is trained using `arabic_w2v_cc_300d` word embeddings, so please use the same embeddings in the pipeline.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/aner_cc_300d_ar_4.0.0_3.0_1658869537384.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/aner_cc_300d_ar_4.0.0_3.0_1658869537384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
word_embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner = NerDLModel.pretrained("aner_cc_300d", "ar") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز")
```
```scala
val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("aner_cc_300d", "ar")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.ner").predict("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""")
```
## Results
```bash
| | ner_chunk | entity |
|---:|-------------------------:|-------------:|
| 0 | قوات الثورة العربية | ORG |
| 1 | دمشق | LOC |
| 2 | الإنكليز | PER |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|aner_cc_300d|
|Type:|ner|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token, word_embeddings]|
|Output Labels:|[ner]|
|Language:|ar|
|Size:|14.9 MB|
|Dependencies:|arabic_w2v_cc_300d|
## References
This model is trained on data obtained from http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp
## Benchmarking
```bash
label tp fp fn prec rec f1
B-LOC 163 28 34 0.853403 0.827411 0.840206
I-ORG 60 10 5 0.857142 0.923077 0.888889
I-MIS 124 53 53 0.700565 0.700565 0.700565
I-LOC 64 20 23 0.761904 0.735632 0.748538
B-MIS 297 71 52 0.807065 0.851003 0.828452
I-PER 84 23 13 0.785046 0.865979 0.823530
B-ORG 54 9 12 0.857142 0.818181 0.837210
B-PER 182 26 33 0.875 0.846512 0.860520
Macro-average 1028 240 225 0.812159 0.821045 0.816578
Micro-average 1028 240 225 0.810726 0.820431 0.815550
```
---
layout: model
title: English DistilBertForQuestionAnswering model (from hark99)
author: John Snow Labs
name: distilbert_qa_hark99_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hark99`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hark99_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725361304.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hark99_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725361304.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hark99_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hark99_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_hark99").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_hark99_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/hark99/distilbert-base-uncased-finetuned-squad
---
layout: model
title: French CamemBert Embeddings (from peterhsu)
author: John Snow Labs
name: camembert_embeddings_peterhsu_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `peterhsu`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_peterhsu_generic_model_fr_3.4.4_3.0_1653989980368.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_peterhsu_generic_model_fr_3.4.4_3.0_1653989980368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_peterhsu_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_peterhsu_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_peterhsu_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/peterhsu/dummy-model
---
layout: model
title: Recognize Entities OntoNotes - ELECTRA Base
author: John Snow Labs
name: onto_recognize_entities_electra_base
date: 2020-12-09
task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public]
language: en
nav_key: models
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [en, open_source, pipeline]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `electra_base_uncased` embeddings. It can extract up to following 18 entities:
## Predicted Entities
`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_base_en_2.7.0_2.4_1607511462448.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_base_en_2.7.0_2.4_1607511462448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('onto_recognize_entities_electra_base')
result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("onto_recognize_entities_electra_base")
val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.")
```
{:.nlu-block}
```python
import nlu
text = ["""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament."""]
ner_df = nlu.load('en.ner.onto.electra.base').predict(text, output_level='chunk')
ner_df[["entities", "entities_class"]]
```
{:.h2_title}
## Results
```bash
+------------+---------+
|chunk |ner_label|
+------------+---------+
|Johnson |PERSON |
|first |ORDINAL |
|2001 |DATE |
|Parliament |ORG |
|eight years |DATE |
|London |GPE |
|2008 to 2016|DATE |
|Parliament |ORG |
+------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|onto_recognize_entities_electra_base|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|en|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- Tokenizer
- BertEmbeddings
- NerDLModel
- NerConverter
---
layout: model
title: English T5ForConditionalGeneration Small Cased model (from google)
author: John Snow Labs
name: t5_efficient_small_nl36
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl36` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl36_en_4.3.0_3.0_1675122498742.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl36_en_4.3.0_3.0_1675122498742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_small_nl36","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_small_nl36","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_small_nl36|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|572.2 MB|
## References
- https://huggingface.co/google/t5-efficient-small-nl36
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from minhdang241)
author: John Snow Labs
name: distilbert_qa_robust_tapt
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robustqa-tapt` is a English model originally trained by `minhdang241`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_robust_tapt_en_4.3.0_3.0_1672775384901.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_robust_tapt_en_4.3.0_3.0_1672775384901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robust_tapt","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robust_tapt","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_robust_tapt|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/minhdang241/robustqa-tapt
---
layout: model
title: Relation Extraction between different oncological entity types using granular classes (ReDL)
author: John Snow Labs
name: redl_oncology_granular_biobert_wip
date: 2023-01-15
tags: [licensed, clinical, oncology, en, relation_extraction, temporal, test, biomarker, anatomy, tensorflow]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Using this relation extraction model, four relation types can be identified: is_date_of (between date entities and other clinical entities), is_size_of (between Tumor_Finding and Tumor_Size), is_location_of (between anatomical entities and other entities) and is_finding_of (between test entities and their results).
## Predicted Entities
`is_date_of`, `is_finding_of`, `is_location_of`, `is_size_of`, `O`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_granular_biobert_wip_en_4.2.4_3.0_1673768709402.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_granular_biobert_wip_en_4.2.4_3.0_1673768709402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos_tags")
dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \
.setInputCols(["sentence", "pos_tags", "token"]) \
.setOutputCol("dependencies")
re_ner_chunk_filter = RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"])
re_model = RelationExtractionDLModel.pretrained("redl_oncology_granular_biobert_wip", "en", "clinical/models")\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relation_extraction")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model])
data = spark.createDataFrame([["A mastectomy was performed two months ago, and a 3 cm mass was extracted."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos_tags", "token"))
.setOutputCol("dependencies")
val re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunk", "dependencies"))
.setOutputCol("re_ner_chunk")
.setMaxSyntacticDistance(10)
.setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"))
val re_model = RelationExtractionDLModel.pretrained("redl_oncology_granular_biobert_wip", "en", "clinical/models")
.setPredictionThreshold(0.5f)
.setInputCols(Array("re_ner_chunk", "sentence"))
.setOutputCol("relation_extraction")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model))
val data = Seq("A mastectomy was performed two months ago.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.oncology_granular_biobert_wip").predict("""A mastectomy was performed two months ago, and a 3 cm mass was extracted.""")
```
## Results
```bash
+----------+--------------+-------------+-----------+----------+-------------+-------------+-----------+--------------+----------+
| relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence|
+----------+--------------+-------------+-----------+----------+-------------+-------------+-----------+--------------+----------+
|is_date_of|Cancer_Surgery| 2| 11|mastectomy|Relative_Date| 27| 40|two months ago| 0.9652523|
|is_size_of| Tumor_Size| 49| 52| 3 cm|Tumor_Finding| 54| 57| mass|0.81723577|
+----------+--------------+-------------+-----------+----------+-------------+-------------+-----------+--------------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_oncology_granular_biobert_wip|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|401.7 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label recall precision f1
O 0.83 0.91 0.87
is_date_of 0.82 0.80 0.81
is_finding_of 0.92 0.85 0.88
is_location_of 0.95 0.85 0.90
is_size_of 0.91 0.80 0.85
macro-avg 0.89 0.84 0.86
```
---
layout: model
title: Detect Entities Related to Cancer Diagnosis
author: John Snow Labs
name: ner_oncology_diagnosis
date: 2022-11-24
tags: [licensed, clinical, en, ner, oncology]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts entities related to cancer diagnosis, such as Metastasis, Histological_Type or Invasion.
Definitions of Predicted Entities:
- `Adenopathy`: Mentions of pathological findings of the lymph nodes.
- `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction.
- `Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. "BI-RADS" or "Allred score").
- `Grade`: All pathological grading of tumors (e.g. "grade 1") or degrees of cellular differentiation (e.g. "well-differentiated")
- `Histological_Type`: Histological variants or cancer subtypes, such as "papillary", "clear cell" or "medullary".
- `Invasion`: Mentions that refer to tumor invasion, such as "invasion" or "involvement". Metastases or lymph node involvement are excluded from this category.
- `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions.
- `Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. "malignant ductal cells").
- `Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. "ECOG performance status of 4").
- `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced".
- `Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm").
- `Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. "3 cm").
## Predicted Entities
`Adenopathy`, `Cancer_Dx`, `Cancer_Score`, `Grade`, `Histological_Type`, `Invasion`, `Metastasis`, `Pathology_Result`, `Performance_Status`, `Staging`, `Tumor_Finding`, `Tumor_Size`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_diagnosis_en_4.2.2_3.0_1669300474926.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_diagnosis_en_4.2.2_3.0_1669300474926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_diagnosis", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma.
Last week she was also found to have a lung metastasis."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_diagnosis", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma.
Last week she was also found to have a lung metastasis.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_diagnosis").predict("""Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma.
Last week she was also found to have a lung metastasis.""")
```
## Results
```bash
| chunk | ner_label |
|:-------------|:------------------|
| tumor | Tumor_Finding |
| adenopathies | Adenopathy |
| invasive | Histological_Type |
| ductal | Histological_Type |
| carcinoma | Cancer_Dx |
| metastasis | Metastasis |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_diagnosis|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|34.3 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Histological_Type 354 63 99 453 0.85 0.78 0.81
Staging 234 27 24 258 0.90 0.91 0.90
Cancer_Score 36 15 26 62 0.71 0.58 0.64
Tumor_Finding 1121 83 136 1257 0.93 0.89 0.91
Invasion 154 27 27 181 0.85 0.85 0.85
Tumor_Size 1058 126 71 1129 0.89 0.94 0.91
Adenopathy 66 10 30 96 0.87 0.69 0.77
Performance_Status 116 15 19 135 0.89 0.86 0.87
Pathology_Result 852 686 290 1142 0.55 0.75 0.64
Metastasis 356 15 14 370 0.96 0.96 0.96
Cancer_Dx 1302 88 92 1394 0.94 0.93 0.94
Grade 201 23 35 236 0.90 0.85 0.87
macro_avg 5850 1178 863 6713 0.85 0.83 0.84
micro_avg 5850 1178 863 6713 0.85 0.87 0.86
```
---
layout: model
title: Part of Speech for Chinese
author: John Snow Labs
name: pos_ud_gsd_trad
date: 2021-03-09
tags: [part_of_speech, open_source, chinese, pos_ud_gsd_trad, zh]
task: Part of Speech Tagging
language: zh
edition: Spark NLP 3.0.0
spark_version: 3.0
supported: true
annotator: PerceptronModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`.
## Predicted Entities
- AUX
- ADJ
- PUNCT
- ADV
- VERB
- NUM
- NOUN
- PRON
- PART
- ADP
- DET
- CCONJ
- PROPN
- X
- SYM
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_trad_zh_3.0.0_3.0_1615292436582.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_trad_zh_3.0.0_3.0_1615292436582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
pos = PerceptronModel.pretrained("pos_ud_gsd_trad", "zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])
example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"])
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val pos = PerceptronModel.pretrained("pos_ud_gsd_trad", "zh")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))
val data = Seq("从John Snow Labs你好! ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = [""从John Snow Labs你好! ""]
token_df = nlu.load('zh.pos.ud_gsd_trad').predict(text)
token_df
```
## Results
```bash
token pos
0 从 PROPN
1 JohnSnowLabs X
2 你 PRON
3 好 ADJ
4 ! PUNCT
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_gsd_trad|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[pos]|
|Language:|zh|
---
layout: model
title: English DistilBertForQuestionAnswering model (from Ayoola)
author: John Snow Labs
name: distilbert_qa_Ayoola_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Ayoola`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Ayoola_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724089038.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Ayoola_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724089038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Ayoola_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Ayoola_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Ayoola").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_Ayoola_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/Ayoola/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Swedish BERT Base Cased Embedding
author: John Snow Labs
name: bert_base_cased
date: 2021-09-07
tags: [open_source, bert_embeddings, swedish, cased, sv]
task: Embeddings
language: sv
edition: Spark NLP 3.2.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_cased_sv_3.2.2_3.0_1630999671555.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_sv_3.2.2_3.0_1630999671555.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = BertEmbeddings.pretrained("bert_base_cased", "sv") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
```
```scala
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "sv")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
nlu.load("sv.embed.bert.base_cased").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_cased|
|Compatibility:|Spark NLP 3.2.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|sv|
|Case sensitive:|true|
## Data Source
The model is imported from: https://huggingface.co/KB/bert-base-swedish-cased
---
layout: model
title: Legal No violation Clause Binary Classifier
author: John Snow Labs
name: legclf_no_violation_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `no-violation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `no-violation`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_violation_clause_en_1.0.0_3.2_1660122713710.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_violation_clause_en_1.0.0_3.2_1660122713710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[no-violation]|
|[other]|
|[other]|
|[no-violation]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_no_violation_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.8 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
no-violation 1.00 1.00 1.00 35
other 1.00 1.00 1.00 93
accuracy - - 1.00 128
macro-avg 1.00 1.00 1.00 128
weighted-avg 1.00 1.00 1.00 128
```
---
layout: model
title: Fast Neural Machine Translation Model from Lunda to English
author: John Snow Labs
name: opus_mt_lun_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, lun, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `lun`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_lun_en_xx_2.7.0_2.4_1609167008180.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_lun_en_xx_2.7.0_2.4_1609167008180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_lun_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_lun_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.lun.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_lun_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Portuguese BertForMaskedLM Cased model (from pucpr)
author: John Snow Labs
name: bert_embeddings_biobertpt_all
date: 2022-12-02
tags: [pt, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: pt
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobertpt-all` is a Portuguese model originally trained by `pucpr`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_biobertpt_all_pt_4.2.4_3.0_1670020710320.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_biobertpt_all_pt_4.2.4_3.0_1670020710320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_biobertpt_all","pt") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_biobertpt_all","pt")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_biobertpt_all|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|pt|
|Size:|667.6 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/pucpr/biobertpt-all
- https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/
- https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/
- https://github.com/HAILab-PUCPR/BioBERTpt
---
layout: model
title: Detect Entities Related to Cancer Therapies
author: John Snow Labs
name: ner_oncology_therapy
date: 2022-11-24
tags: [clinical, en, licensed, oncology, treatment, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts entities related to oncology therapies using granular labels, including mentions of treatments, posology information and line of therapy.
Definitions of Predicted Entities:
- `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment.
- `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as "chemotherapy".
- `Cycle_Count`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles").
- `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5").
- `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle").
- `Dosage`: The quantity prescribed by the physician for an active ingredient.
- `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks").
- `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid").
- `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as "hormonal therapy".
- `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as "immunotherapy".
- `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment").
- `Radiotherapy`: Terms that indicate the use of Radiotherapy.
- `Radiation_Dose`: Dose used in radiotherapy.
- `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement".
- `Route`: Words indicating the type of administration route (such as "PO" or "transdermal").
- `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as "targeted therapy".
- `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. "chemoradiotherapy" or "adjuvant therapy").
## Predicted Entities
`Cancer_Surgery`, `Chemotherapy`, `Cycle_Count`, `Cycle_Day`, `Cycle_Number`, `Dosage`, `Duration`, `Frequency`, `Hormonal_Therapy`, `Immunotherapy`, `Line_Of_Therapy`, `Radiotherapy`, `Radiation_Dose`, `Response_To_Treatment`, `Route`, `Targeted_Therapy`, `Unspecific_Therapy`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_en_4.2.2_3.0_1669308088671.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_en_4.2.2_3.0_1669308088671.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_therapy", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.
The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.
The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_therapy", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.
The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.
The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_therapy").predict("""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.
The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.
The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.""")
```
## Results
```bash
| chunk | ner_label |
|:-------------------------------|:----------------------|
| mastectomy | Cancer_Surgery |
| axillary lymph node dissection | Cancer_Surgery |
| radiotherapy | Radiotherapy |
| recurred | Response_To_Treatment |
| adriamycin | Chemotherapy |
| 60 mg/m2 | Dosage |
| cyclophosphamide | Chemotherapy |
| 600 mg/m2 | Dosage |
| six courses | Cycle_Count |
| first line | Line_Of_Therapy |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_therapy|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|34.4 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Cycle_Number 78 41 19 97 0.66 0.80 0.72
Response_To_Treatment 451 205 145 596 0.69 0.76 0.72
Cycle_Count 210 75 20 230 0.74 0.91 0.82
Unspecific_Therapy 189 76 89 278 0.71 0.68 0.70
Chemotherapy 831 87 48 879 0.91 0.95 0.92
Targeted_Therapy 194 28 34 228 0.87 0.85 0.86
Radiotherapy 279 35 31 310 0.89 0.90 0.89
Cancer_Surgery 720 192 99 819 0.79 0.88 0.83
Line_Of_Therapy 95 6 11 106 0.94 0.90 0.92
Hormonal_Therapy 170 6 15 185 0.97 0.92 0.94
Immunotherapy 96 17 32 128 0.85 0.75 0.80
Cycle_Day 205 38 43 248 0.84 0.83 0.84
Frequency 363 33 64 427 0.92 0.85 0.88
Route 93 6 20 113 0.94 0.82 0.88
Duration 527 102 234 761 0.84 0.69 0.76
Dosage 959 63 101 1060 0.94 0.90 0.92
Radiation_Dose 106 12 20 126 0.90 0.84 0.87
macro_avg 5566 1022 1025 6591 0.85 0.84 0.84
micro_avg 5566 1022 1025 6591 0.85 0.84 0.84
```
---
layout: model
title: Translate Setswana to English Pipeline
author: John Snow Labs
name: translate_tn_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, tn, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `tn`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_tn_en_xx_2.7.0_2.4_1609686822973.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_tn_en_xx_2.7.0_2.4_1609686822973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_tn_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_tn_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.tn.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_tn_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English image_classifier_vit_base_patch16_224_in21k_aidSat ViTForImageClassification from YKXBCi
author: John Snow Labs
name: image_classifier_vit_base_patch16_224_in21k_aidSat
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_in21k_aidSat` is a English model originally trained by YKXBCi.
## Predicted Entities
`Square`, `Farmland`, `BaseballField`, `Park`, `Commercial`, `Pond`, `Airport`, `SparseResidential`, `Church`, `School`, `Viaduct`, `Stadium`, `Desert`, `BareLand`, `MediumResidential`, `Center`, `Industrial`, `Playground`, `Port`, `DenseResidential`, `StorageTanks`, `Beach`, `Bridge`, `Mountain`, `River`, `Meadow`, `Resort`, `Parking`, `Forest`, `RailwayStation`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_aidSat_en_4.1.0_3.0_1660167644527.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_aidSat_en_4.1.0_3.0_1660167644527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_base_patch16_224_in21k_aidSat", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_base_patch16_224_in21k_aidSat", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_base_patch16_224_in21k_aidSat|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|322.0 MB|
---
layout: model
title: Arabic Part of Speech Tagger (DA-Dialectal Arabic dataset, Modern Standard Arabic-MSA POS)
author: John Snow Labs
name: bert_pos_bert_base_arabic_camelbert_da_pos_msa
date: 2022-04-26
tags: [bert, pos, part_of_speech, ar, open_source]
task: Part of Speech Tagging
language: ar
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-da-pos-msa` is a Arabic model orginally trained by `CAMeL-Lab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_da_pos_msa_ar_3.4.2_3.0_1650993280099.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_da_pos_msa_ar_3.4.2_3.0_1650993280099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_da_pos_msa","ar") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_da_pos_msa","ar")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("أنا أحب الشرارة NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ar.pos.arabic_camelbert_da_pos_msa").predict("""أنا أحب الشرارة NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_bert_base_arabic_camelbert_da_pos_msa|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ar|
|Size:|407.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-pos-msa
- https://dl.acm.org/doi/pdf/10.5555/1621804.1621808
- https://arxiv.org/abs/2103.06678
- https://github.com/CAMeL-Lab/CAMeLBERT
- https://github.com/CAMeL-Lab/camel_tools
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from hamishm)
author: John Snow Labs
name: distilbert_qa_hamishm_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hamishm`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hamishm_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771049910.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hamishm_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771049910.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hamishm_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hamishm_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_hamishm_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/hamishm/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Detect diseases in medical text (biobert)
author: John Snow Labs
name: ner_diseases_biobert
date: 2021-04-01
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract entities pertaining to different types of general diseases using pretrained NER model.
## Predicted Entities
`Disease`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DIAG_PROC/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_en_3.0.0_3.0_1617260638998.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_en_3.0.0_3.0_1617260638998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_diseases_biobert", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]], ["text"]))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_diseases_biobert", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.diseases.biobert").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_diseases_biobert|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
---
layout: model
title: Czech asr_wav2vec2_xls_r_300m_250 TFWav2Vec2ForCTC from comodoro
author: John Snow Labs
name: pipeline_asr_wav2vec2_xls_r_300m_250
date: 2022-09-25
tags: [wav2vec2, cs, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: cs
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_250` is a Czech model originally trained by comodoro.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_250_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_250_cs_4.2.0_3.0_1664119400609.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_250_cs_4.2.0_3.0_1664119400609.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_250', lang = 'cs')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_250", lang = "cs")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xls_r_300m_250|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|cs|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from linh101201)
author: John Snow Labs
name: roberta_qa_linh101201_base_finetuned_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `linh101201`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_linh101201_base_finetuned_squad_en_4.3.0_3.0_1674217360099.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_linh101201_base_finetuned_squad_en_4.3.0_3.0_1674217360099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_linh101201_base_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_linh101201_base_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_linh101201_base_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|424.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/linh101201/roberta-base-finetuned-squad
---
layout: model
title: English XLMRobertaForTokenClassification Large Uncased model (from asahi417)
author: John Snow Labs
name: xlmroberta_ner_tner_large_uncased_ontonotes5
date: 2022-08-13
tags: [en, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tner-xlm-roberta-large-uncased-ontonotes5` is a English model originally trained by `asahi417`.
## Predicted Entities
`language`, `time`, `percent`, `quantity`, `product`, `ordinal number`, `cardinal number`, `event`, `geopolitical area`, `facility`, `organization`, `work of art`, `group`, `money`, `law`, `person`, `location`, `date`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_uncased_ontonotes5_en_4.1.0_3.0_1660425481816.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_uncased_ontonotes5_en_4.1.0_3.0_1660425481816.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_uncased_ontonotes5","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_uncased_ontonotes5","en")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_tner_large_uncased_ontonotes5|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|1.8 GB|
|Case sensitive:|false|
|Max sentence length:|256|
## References
- https://huggingface.co/asahi417/tner-xlm-roberta-large-uncased-ontonotes5
- https://github.com/asahi417/tner
---
layout: model
title: English asr_wav2vec2_xls_r_timit_tokenizer_base TFWav2Vec2ForCTC from hrdipto
author: John Snow Labs
name: pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_timit_tokenizer_base` is a English model originally trained by hrdipto.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base_en_4.2.0_3.0_1664040322145.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base_en_4.2.0_3.0_1664040322145.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|349.4 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Japanese Part of Speech Tagger (from KoichiYasuoka)
author: John Snow Labs
name: bert_pos_bert_base_japanese_upos
date: 2022-05-09
tags: [bert, pos, part_of_speech, ja, open_source]
task: Part of Speech Tagging
language: ja
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-upos` is a Japanese model orginally trained by `KoichiYasuoka`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_japanese_upos_ja_3.4.2_3.0_1652091854179.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_japanese_upos_ja_3.4.2_3.0_1652091854179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_japanese_upos","ja") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_japanese_upos","ja")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("Spark NLPが大好きです").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_pos_bert_base_japanese_upos|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|ja|
|Size:|338.8 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/KoichiYasuoka/bert-base-japanese-upos
- https://universaldependencies.org/u/pos/
- https://github.com/KoichiYasuoka/esupar
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from vinitharaj)
author: John Snow Labs
name: distilbert_qa_vinitharaj_base_uncased_finetuned_squad2
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `vinitharaj`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vinitharaj_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773632350.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vinitharaj_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773632350.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vinitharaj_base_uncased_finetuned_squad2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vinitharaj_base_uncased_finetuned_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_vinitharaj_base_uncased_finetuned_squad2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/vinitharaj/distilbert-base-uncased-finetuned-squad2
---
layout: model
title: English BertForQuestionAnswering model (from AnonymousSub)
author: John Snow Labs
name: bert_qa_bert_FT_new_newsqa
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_FT_new_newsqa` is a English model orginally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_FT_new_newsqa_en_4.0.0_3.0_1654185046001.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_FT_new_newsqa_en_4.0.0_3.0_1654185046001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_FT_new_newsqa","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_FT_new_newsqa","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.news.bert.ft_new.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_FT_new_newsqa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/AnonymousSub/bert_FT_new_newsqa
---
layout: model
title: Fast Neural Machine Translation Model from Luo (Kenya and Tanzania) to English
author: John Snow Labs
name: opus_mt_luo_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, luo, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `luo`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_luo_en_xx_2.7.0_2.4_1609167362702.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_luo_en_xx_2.7.0_2.4_1609167362702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_luo_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_luo_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.luo.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_luo_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Vietnamese BertForQuestionAnswering model (from nvkha)
author: John Snow Labs
name: bert_qa_bert_qa_vi_nvkha
date: 2022-06-03
tags: [open_source, question_answering, bert]
task: Question Answering
language: vi
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-qa-vi` is a Vietnamese model orginally trained by `nvkha`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_qa_vi_nvkha_vi_4.0.0_3.0_1654249815695.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_qa_vi_nvkha_vi_4.0.0_3.0_1654249815695.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_qa_vi_nvkha","vi") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_qa_vi_nvkha","vi")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("vi.answer_question.bert.by_nvkha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_qa_vi_nvkha|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|vi|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/nvkha/bert-qa-vi
---
layout: model
title: Translate Tongan to English Pipeline
author: John Snow Labs
name: translate_to_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, to, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `to`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_to_en_xx_2.7.0_2.4_1609688214187.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_to_en_xx_2.7.0_2.4_1609688214187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_to_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_to_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.to.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_to_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Summarize Clinical Notes in Layman Terms
author: John Snow Labs
name: summarizer_clinical_laymen
date: 2023-05-31
tags: [licensed, en, clinical, summarization, tensorflow]
task: Summarization
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
engine: tensorflow
annotator: MedicalSummarizer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with custom dataset by John Snow Labs to avoid using clinical jargon on the summaries. It can generate summaries up to 512 tokens given an input text (max 1024 tokens).
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_en_4.4.2_3.0_1685557633038.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_en_4.4.2_3.0_1685557633038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
summarizer = MedicalSummarizer.pretrained("summarizer_clinical_laymen", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("summary")\
.setMaxNewTokens(512)
pipeline = Pipeline(stages=[
document_assembler,
summarizer
])
text ="""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.\n\nPAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.\n\nPAST SURGICAL HISTORY: Pertinent for cholecystectomy.\n\nPSYCHOLOGICAL HISTORY: Negative.\n\nSOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.\n\nFAMILY HISTORY: Pertinent for obesity and hypertension.\n\nMEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.\n\nALLERGIES: She has no known drug allergies.\n\nREVIEW OF SYSTEMS: Negative.\n\nPHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val summarizer = MedicalSummarizer.pretrained("summarizer_clinical_laymen", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("summary")
.setMaxNewTokens(512)
val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer))
val data = Seq("""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.\n\nPAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.\n\nPAST SURGICAL HISTORY: Pertinent for cholecystectomy.\n\nPSYCHOLOGICAL HISTORY: Negative.\n\nSOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.\n\nFAMILY HISTORY: Pertinent for obesity and hypertension.\n\nMEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.\n\nALLERGIES: She has no known drug allergies.\n\nREVIEW OF SYSTEMS: Negative.\n\nPHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
['This is a clinical note about a 34-year-old woman who is interested in having weight loss surgery. She has been overweight for over 20 years and wants to have more energy and improve her self-image. She has tried many diets and weight loss programs, but has not been successful in keeping the weight off. She has a history of hypertension and shortness of breath, but is not allergic to any medications. She will have an upper endoscopy and will be contacted by a nutritionist and social worker. The plan is to have her weight loss surgery through the gastric bypass, rather than Lap-Band.']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|summarizer_clinical_laymen|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|920.5 MB|
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from danhsf)
author: John Snow Labs
name: xlmroberta_ner_danhsf_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `danhsf`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_danhsf_base_finetuned_panx_de_4.1.0_3.0_1660432036772.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_danhsf_base_finetuned_panx_de_4.1.0_3.0_1660432036772.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_danhsf_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_danhsf_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_danhsf_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/danhsf/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Legal Investment Advisory Agreement Document Classifier (Longformer)
author: John Snow Labs
name: legclf_investment_advisory_agreement
date: 2022-12-06
tags: [en, legal, classification, agreement, investment, advisory, licensed, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The `legclf_investment_advisory_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `investment-advisory-agreement` or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.
## Predicted Entities
`investment-advisory-agreement`, `other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_investment_advisory_agreement_en_1.0.0_3.0_1670357942473.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_investment_advisory_agreement_en_1.0.0_3.0_1670357942473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+----------+-------------------------------------------------------------------------------------------------------------------------------------------+
|filename |exploded_entities |
+----------+-------------------------------------------------------------------------------------------------------------------------------------------+
|test0.jpeg|{named_entity, 24, 24, UNITPRICE-B, {confidence -> 95, width -> 66, x -> 306, y -> 229, word -> #010029, token -> #, height -> 17}, []} |
|test0.jpeg|{named_entity, 32, 35, NAME-B, {confidence -> 91, width -> 38, x -> 200, y -> 250, word -> Sale, token -> sale, height -> 17}, []} |
|test0.jpeg|{named_entity, 37, 37, OTHERS, {confidence -> 91, width -> 8, x -> 249, y -> 253, word -> #, token -> #, height -> 15}, []} |
|test0.jpeg|{named_entity, 39, 47, NUM-B, {confidence -> 96, width -> 83, x -> 270, y -> 252, word -> 143710882, token -> 143710882, height -> 17}, []}|
|test0.jpeg|{named_entity, 49, 52, NAME-B, {confidence -> 96, width -> 37, x -> 191, y -> 274, word -> Team, token -> team, height -> 17}, []} |
|test0.jpeg|{named_entity, 66, 68, CNT-B, {confidence -> 88, width -> 28, x -> 82, y -> 296, word -> Jan, token -> jan, height -> 16}, []} |
|test0.jpeg|{named_entity, 114, 114, OTHERS, {confidence -> 63, width -> 27, x -> 229, y -> 323, word -> ***, token -> *, height -> 13}, []} |
+----------+-------------------------------------------------------------------------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|visualner_receipts|
|Type:|ocr|
|Compatibility:|Visual NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|xx|
|Size:|744.4 MB|
## References
CORD
---
layout: model
title: English DistilBertForQuestionAnswering Cased model (from autoevaluate)
author: John Snow Labs
name: distilbert_qa_extractive
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `extractive-question-answering` is a English model originally trained by `autoevaluate`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_extractive_en_4.3.0_3.0_1672775125688.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_extractive_en_4.3.0_3.0_1672775125688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_extractive","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_extractive","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_extractive|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/autoevaluate/extractive-question-answering
---
layout: model
title: BERTje A Dutch BERT model
author: John Snow Labs
name: bert_base_dutch_cased
date: 2021-05-20
tags: [open_source, embeddings, bert, dutch, nl]
task: Embeddings
language: nl
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.
For details, check out our paper on [arXiv](https://arxiv.org/abs/1912.09582), the code on [Github](https://github.com/wietsedv/bertje) and related work on [Semantic Scholar](https://www.semanticscholar.org/paper/BERTje%3A-A-Dutch-BERT-Model-Vries-Cranenburgh/a4d5e425cac0bf84c86c0c9f720b6339d6288ffa).
The paper and Github page mention fine-tuned models that are available [here](https://huggingface.co/wietsedv).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_dutch_cased_nl_3.1.0_3.0_1621500934814.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_dutch_cased_nl_3.1.0_3.0_1621500934814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
embeddings = BertEmbeddings.pretrained("bert_base_dutch_cased", "nl") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
```
```scala
val embeddings = BertEmbeddings.pretrained("bert_base_dutch_cased", "nl")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
```
{:.nlu-block}
```python
import nlu
nlu.load("nl.embed.bert").predict("""Put your text here.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_base_dutch_cased|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, sentence]|
|Output Labels:|[embeddings]|
|Language:|nl|
|Case sensitive:|true|
## Data Source
https://huggingface.co/dbmdz/bert-base-german-cased
## Benchmarking
```bash
The arXiv paper lists benchmarks. Here are a couple of comparisons between BERTje, multilingual BERT, BERT-NL, and RobBERT that were done after writing the paper. Unlike some other comparisons, the fine-tuning procedures for these benchmarks are identical for each pre-trained model. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures.
More experimental results will be added to this page when they are finished. Technical details about how a fine-tuned these models will be published later as well as downloadable fine-tuned checkpoints.
All of the tested models are *base* sized (12) layers with cased tokenization.
Headers in the tables below link to original data sources. Scores link to the model pages that correspond to that specific fine-tuned model. These tables will be updated when more simple fine-tuned models are made available.
### Named Entity Recognition
| Model | [CoNLL-2002](https://www.clips.uantwerpen.be/conll2002/ner/) | [SoNaR-1](https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus) | spaCy UD LassySmall |
| ---------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| **BERTje** | [**90.24**](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-conll2002-ner) | [**84.93**](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-sonar-ner) | [86.10](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-udlassy-ner) |
| [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) | [88.61](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-conll2002-ner) | [84.19](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-sonar-ner) | [**86.77**](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-udlassy-ner) |
| [BERT-NL](http://textdata.nl) | 85.05 | 80.45 | 81.62 |
| [RobBERT](https://github.com/iPieter/RobBERT) | 84.72 | 81.98 | 79.84 |
### Part-of-speech tagging
| Model | [UDv2.5 LassySmall](https://universaldependencies.org/treebanks/nl_lassysmall/index.html) |
| ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| **BERTje** | **96.48** |
| [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) | 96.20 |
| [BERT-NL](http://textdata.nl) | 96.10 |
| [RobBERT](https://github.com/iPieter/RobBERT) | 95.91 |
```
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from ParulChaudhari)
author: John Snow Labs
name: distilbert_qa_parulchaudhari_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ParulChaudhari`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_parulchaudhari_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768887661.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_parulchaudhari_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768887661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_parulchaudhari_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_parulchaudhari_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_parulchaudhari_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/ParulChaudhari/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Sentiment Analysis of German news
author: John Snow Labs
name: bert_sequence_classifier_news_sentiment
date: 2022-01-18
tags: [german, sentiment, bert_sequence, de, open_source]
task: Sentiment Analysis
language: de
edition: Spark NLP 3.3.4
spark_version: 3.0
supported: true
annotator: BertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model was imported from `Hugging Face` ([link](https://huggingface.co/mdraw/german-news-sentiment-bert)) and it's been finetuned on news texts about migration for German language, leveraging `Bert` embeddings and `BertForSequenceClassification` for text classification purposes.
## Predicted Entities
`positive`, `negative`, `neutral`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_news_sentiment_de_3.3.4_3.0_1642504435983.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_news_sentiment_de_3.3.4_3.0_1642504435983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_news_sentiment', 'de') \
.setInputCols(['token', 'document']) \
.setOutputCol('class')
pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])
example = spark.createDataFrame([['Die Zahl der Flüchtlinge in Deutschland steigt von Tag zu Tag.']]).toDF("text")
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_news_sentiment", "de")
.setInputCols("document", "token")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val example = Seq.empty["Die Zahl der Flüchtlinge in Deutschland steigt von Tag zu Tag."].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.classify.news_sentiment.bert").predict("""Die Zahl der Flüchtlinge in Deutschland steigt von Tag zu Tag.""")
```
## Results
```bash
['neutral']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_news_sentiment|
|Compatibility:|Spark NLP 3.3.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|de|
|Size:|408.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## Data Source
[https://wortschatz.uni-leipzig.de/en/download/German](https://wortschatz.uni-leipzig.de/en/download/German)
---
layout: model
title: English asr_wav2vec2_large_tedlium TFWav2Vec2ForCTC from sanchit-gandhi
author: John Snow Labs
name: asr_wav2vec2_large_tedlium
date: 2022-09-25
tags: [wav2vec2, en, audio, open_source, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_tedlium` is a English model originally trained by sanchit-gandhi.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_tedlium_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_tedlium_en_4.2.0_3.0_1664094417234.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_tedlium_en_4.2.0_3.0_1664094417234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_large_tedlium", "en")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_large_tedlium", "en")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_large_tedlium|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|en|
|Size:|1.2 GB|
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from ashhyun)
author: John Snow Labs
name: distilbert_qa_ashhyun_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ashhyun`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ashhyun_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770017698.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ashhyun_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770017698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ashhyun_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ashhyun_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_ashhyun_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/ashhyun/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English image_classifier_vit_beer_vs_wine ViTForImageClassification from filipafcastro
author: John Snow Labs
name: image_classifier_vit_beer_vs_wine
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_beer_vs_wine` is a English model originally trained by filipafcastro.
## Predicted Entities
`beer`, `wine`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_beer_vs_wine_en_4.1.0_3.0_1660166570749.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_beer_vs_wine_en_4.1.0_3.0_1660166570749.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_beer_vs_wine", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_beer_vs_wine", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_beer_vs_wine|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Fast Neural Machine Translation Model from Ndonga to English
author: John Snow Labs
name: opus_mt_ng_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, ng, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `ng`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ng_en_xx_2.7.0_2.4_1609169685419.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ng_en_xx_2.7.0_2.4_1609169685419.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ng_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ng_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.ng.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ng_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English T5ForConditionalGeneration Cased model (from OnsElleuch)
author: John Snow Labs
name: t5_logisgenerator
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `logisgenerator` is a English model originally trained by `OnsElleuch`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_logisgenerator_en_4.3.0_3.0_1675104908400.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_logisgenerator_en_4.3.0_3.0_1675104908400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_logisgenerator","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_logisgenerator","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_logisgenerator|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|280.2 MB|
## References
- https://huggingface.co/OnsElleuch/logisgenerator
- https://pypi.org/project/keytotext/
- https://pepy.tech/project/keytotext
- https://colab.research.google.com/github/gagan3012/keytotext/blob/master/notebooks/K2T.ipynb
- https://share.streamlit.io/gagan3012/keytotext/UI/app.py
- https://github.com/gagan3012/keytotext#api
- https://hub.docker.com/r/gagan30/keytotext
- https://keytotext.readthedocs.io/en/latest/?badge=latest
- https://github.com/psf/black
- https://socialify.git.ci/gagan3012/keytotext/image?description=1&forks=1&language=1&owner=1&stargazers=1&theme=Light
---
layout: model
title: Augment Company Names with NASDAQ database
author: John Snow Labs
name: finmapper_nasdaq_companyname
date: 2022-08-09
tags: [en, finance, companies, tickers, nasdaq, data, augmentation, licensed]
task: Chunk Mapping
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model allows you to, given an extracted name of a company, get information about that company, including the Industry, the Sector and the Trading Symbol (ticker).
It can be optionally combined with Entity Resolution to normalize first the name of the company.
## Predicted Entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/FIN_LEG_COMPANY_AUGMENTATION/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_companyname_en_1.0.0_3.2_1660038424307.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_companyname_en_1.0.0_3.2_1660038424307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained('finner_orgs_prods_alias', 'en', 'finance/models')\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
# Optional: To normalize the ORG name using NASDAQ data before the mapping
##########################################################################
chunkToDoc = nlp.Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
chunk_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use_lg", "en")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("chunk_embeddings")
use_er_model = finance.SentenceEntityResolverModel.pretrained('finel_nasdaq_data_company_name', 'en', 'finance/models')\
.setInputCols("chunk_embeddings")\
.setOutputCol('normalized')\
.setDistanceFunction("EUCLIDEAN")
##########################################################################
CM = finance.ChunkMapperModel()\
.pretrained('finmapper_nasdaq_companyname', 'en', 'finance/models')\
.setInputCols(["normalized"])\ #or ner_chunk without normalization
.setOutputCol("mappings")
pipeline = nlp.Pipeline().setStages([document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter,
chunkToDoc, # Optional for normalization
chunk_embeddings, # Optional for normalization
use_er_model, # Optional for normalization
CM])
text = """Altaba Inc. is a company which ..."""
test_data = spark.createDataFrame([[text]]).toDF("text")
model = pipeline.fit(test_data)
lp = nlp.LightPipeline(model)
lp.fullAnnotate(text)
```
## Results
```bash
[Row(mappings=[Row(annotatorType='labeled_dependency', begin=0, end=10, result='AABA', metadata={'sentence': '0', 'chunk': '0', 'entity': 'Altaba Inc.', 'relation': 'ticker', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=10, result='Altaba Inc.', metadata={'sentence': '0', 'chunk': '0', 'entity': 'Altaba Inc.', 'relation': 'company_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=10, result='Altaba', metadata={'sentence': '0', 'chunk': '0', 'entity': 'Altaba Inc.', 'relation': 'short_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=10, result='Asset Management', metadata={'sentence': '0', 'chunk': '0', 'entity': 'Altaba Inc.', 'relation': 'industry', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=10, result='Financial Services', metadata={'sentence': '0', 'chunk': '0', 'entity': 'Altaba Inc.', 'relation': 'sector', 'all_relations': ''}, embeddings=[])])]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finmapper_nasdaq_companyname|
|Type:|finance|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|210.5 KB|
## References
https://data.world/johnsnowlabs/list-of-companies-in-nasdaq-exchanges
---
layout: model
title: Stopwords Remover for Greek (modern) language (663 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, el, open_source]
task: Stop Words Removal
language: el
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_el_3.4.1_3.0_1646672928398.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_el_3.4.1_3.0_1646672928398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","el") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Δεν είστε καλύτεροι από μένα"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","el")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Δεν είστε καλύτεροι από μένα").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("el.stopwords").predict("""Δεν είστε καλύτεροι από μένα""")
```
## Results
```bash
+-----------------+
|result |
+-----------------+
|[καλύτεροι, μένα]|
+-----------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|el|
|Size:|3.8 KB|
---
layout: model
title: Translate Caucasian languages to English Pipeline
author: John Snow Labs
name: translate_cau_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, cau, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `cau`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_cau_en_xx_2.7.0_2.4_1609687516556.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_cau_en_xx_2.7.0_2.4_1609687516556.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_cau_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_cau_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.cau.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_cau_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for ICD10-CM (Augmented)
author: John Snow Labs
name: sbiobertresolve_icd10cm_augmented
date: 2023-05-31
tags: [licensed, en, clinical, entity_resolution, icd10cm]
task: Entity Resolution
language: en
edition: Healthcare NLP 4.4.2
spark_version: 3.0
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps extracted medical entities to ICD-10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, it has been augmented with synonyms for making it more accurate.
## Predicted Entities
`ICD-10-CM Codes`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_4.4.2_3.0_1685503130827.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_4.4.2_3.0_1685503130827.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("word_embeddings")
ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token", "word_embeddings"])\
.setOutputCol("ner")\
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["PROBLEM"])
c2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sentence_embeddings")\
.setCaseSensitive(False)
icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented", "en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
resolver_pipeline = Pipeline(stages = [document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter,
c2doc,
sbert_embedder,
icd_resolver])
data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."""]]).toDF("text")
result = resolver_pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList("PROBLEM")
val chunk2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("ner_chunk_doc")
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented", "en", "clinical/models")
.setInputCols("sbert_embeddings")
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunk2doc,
sbert_embedder,
icd10_resolver))
val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+
| ner_chunk| entity|icd10_code| resolutions| all_codes|
+-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+
| gestational diabetes mellitus|PROBLEM| O24.4|[gestational diabetes mellitus [gestational diabetes mellitus], gestatio...| [O24.4, O24.41, O24.43, Z86.32, Z87.5, O24.31, O24.11, O24.1, O24.81]|
|subsequent type two diabetes mellitus|PROBLEM| O24.11|[pre-existing type 2 diabetes mellitus [pre-existing type 2 diabetes mel...|[O24.11, E11.8, E11, E13.9, E11.9, E11.3, E11.44, Z86.3, Z86.39, E11.32,...|
| obesity|PROBLEM| E66.9|[obesity [obesity, unspecified], abdominal obesity [other obesity], obes...|[E66.9, E66.8, Z68.41, Q13.0, E66, E66.01, Z86.39, E34.9, H35.50, Z83.49...|
| a body mass index|PROBLEM| Z68.41|[finding of body mass index [body mass index [bmi] 40.0-44.9, adult], ob...|[Z68.41, E66.9, R22.9, Z68.1, R22.3, R22.1, Z68, R22.2, R22.0, R41.89, M...|
| polyuria|PROBLEM| R35|[polyuria [polyuria], nocturnal polyuria [nocturnal polyuria], polyuric ...|[R35, R35.81, R35.8, E23.2, R31, R35.0, R82.99, N40.1, E72.3, O04.8, R30...|
| polydipsia|PROBLEM| R63.1|[polydipsia [polydipsia], psychogenic polydipsia [other impulse disorder...|[R63.1, F63.89, E23.2, F63.9, O40, G47.5, M79.89, R63.2, R06.1, H53.8, I...|
| poor appetite|PROBLEM| R63.0|[poor appetite [anorexia], poor feeding [feeding problem of newborn, uns...|[R63.0, P92.9, R43.8, R43.2, E86, R19.6, F52.0, Z72.4, R06.89, Z76.89, R...|
| vomiting|PROBLEM| R11.1|[vomiting [vomiting], intermittent vomiting [nausea and vomiting], vomit...| [R11.1, R11, R11.10, G43.A1, P92.1, P92.09, G43.A, R11.13, R11.0]|
| a respiratory tract infection|PROBLEM| J98.8|[respiratory tract infection [other specified respiratory disorders], up...|[J98.8, J06.9, A49.9, J22, J20.9, Z59.3, T17, J04.10, Z13.83, J18.9, P28...|
+-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10cm_augmented|
|Compatibility:|Healthcare NLP 4.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icd10cm_code]|
|Language:|en|
|Size:|1.4 GB|
|Case sensitive:|false|
---
layout: model
title: Kannada RoBERTa Embeddings (from Chakita)
author: John Snow Labs
name: roberta_embeddings_KNUBert
date: 2022-04-14
tags: [roberta, embeddings, kn, open_source]
task: Embeddings
language: kn
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `KNUBert` is a Kannada model orginally trained by `Chakita`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_KNUBert_kn_3.4.2_3.0_1649948307214.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_KNUBert_kn_3.4.2_3.0_1649948307214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_KNUBert","kn") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["ನಾನು ಸ್ಪಾರ್ಕ್ ಎನ್ಎಲ್ಪಿ ಪ್ರೀತಿಸುತ್ತೇನೆ"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_KNUBert","kn")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("ನಾನು ಸ್ಪಾರ್ಕ್ ಎನ್ಎಲ್ಪಿ ಪ್ರೀತಿಸುತ್ತೇನೆ").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("kn.embed.KNUBert").predict("""ನಾನು ಸ್ಪಾರ್ಕ್ ಎನ್ಎಲ್ಪಿ ಪ್ರೀತಿಸುತ್ತೇನೆ""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_KNUBert|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|kn|
|Size:|314.5 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Chakita/KNUBert
---
layout: model
title: Relation extraction between Drugs and ADE (ReDL)
author: John Snow Labs
name: redl_ade_biobert
date: 2023-01-14
tags: [relation_extraction, en, clinical, licensed, ade, biobert, tensorflow]
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is an end-to-end trained BioBERT model, capable of Relating Drugs and adverse reactions caused by them; It predicts if an adverse event is caused by a drug or not. 1 : Shows the adverse event and drug entities are related, 0 : Shows the adverse event and drug entities are not related.
## Predicted Entities
`0`, `1`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ADE/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/RE_ADE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_ade_biobert_en_4.2.4_3.0_1673708531142.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_ade_biobert_en_4.2.4_3.0_1673708531142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = NerConverterInternal() \
.setInputCols(["sentences", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
dependency_parser = DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
# Set a filter on pairs of named entities which will be treated as relation candidates
re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setMaxSyntacticDistance(10)\
.setOutputCol("re_ner_chunks")\
.setRelationPairs(['ade-drug', 'drug-ade'])
# The dataset this model is trained to is sentence-wise.
# This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
re_model = RelationExtractionDLModel()\
.pretrained('redl_ade_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter,
sentencer,
tokenizer,
pos_tagger,
words_embedder,
ner_tagger,
ner_converter,
dependency_parser,
re_ner_chunk_filter,
re_model])
light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
text ="""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps."""
annotations = light_pipeline.fullAnnotate(text)
```
```scala
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val re_ner_chunk_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
.setRelationPairs(Array("drug-ade", "ade-drug"))
// The dataset this model is trained to is sentence-wise.
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val re_model = RelationExtractionDLModel()
.pretrained("redl_ade_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter,
sentencer,
tokenizer,
words_embedder,
ner_tagger,
ner_converter,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model))
val data = Seq("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation.ade").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps.""")
```
## Results
```bash
| relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence |
|-----------:|:----------|----------------:|--------------:|:----------|:----------|----------------:|--------------:|:---------------|-------------:|
| 1 | DRUG | 12 | 18 | Lipitor | ADE | 52 | 65 | severe fatigue | 0.998156 |
| 1 | DRUG | 97 | 105 | voltarene | ADE | 144 | 156 | muscle cramps | 0.985513 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_ade_biobert|
|Compatibility:|Healthcare NLP 4.2.4+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|401.7 MB|
## References
This model is trained on custom data annotated by JSL.
## Benchmarking
```bash
label Recall Precision F1 Support
0 0.829 0.895 0.861 1146
1 0.955 0.923 0.939 2454
Avg. 0.892 0.909 0.900 -
Weighted-Avg. 0.915 0.914 0.914 -
```
---
layout: model
title: English BertForQuestionAnswering model (from MrAnderson)
author: John Snow Labs
name: bert_qa_bert_base_512_full_trivia
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-512-full-trivia` is a English model orginally trained by `MrAnderson`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_512_full_trivia_en_4.0.0_3.0_1654179670286.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_512_full_trivia_en_4.0.0_3.0_1654179670286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_512_full_trivia","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_512_full_trivia","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.trivia.bert.base_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_512_full_trivia|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/MrAnderson/bert-base-512-full-trivia
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223765535.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223765535.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|460.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0
---
layout: model
title: Extract Mentions of Response to Cancer Treatment
author: John Snow Labs
name: ner_oncology_response_to_treatment_wip
date: 2022-10-01
tags: [licensed, clinical, oncology, en, ner, treatment]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts entities related to the patient"s response to the oncology treatment, including clinical response and changes in tumor size.
Definitions of Predicted Entities:
- `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment").
- `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement".
- `Size_Trend`: Terms related to the changes in the size of the tumor (such as "growth" or "reduced in size").
## Predicted Entities
`Line_Of_Therapy`, `Response_To_Treatment`, `Size_Trend`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_wip_en_4.0.0_3.0_1664585303681.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_wip_en_4.0.0_3.0_1664585303681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_response_to_treatment_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["She completed her first-line therapy, but some months later there was recurrence of the breast cancer. "]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_response_to_treatment_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("She completed her first-line therapy, but some months later there was recurrence of the breast cancer. ").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_response_to_treatment_wip").predict("""She completed her first-line therapy, but some months later there was recurrence of the breast cancer. """)
```
## Results
```bash
| chunk | ner_label |
|:-----------|:----------------------|
| first-line | Line_Of_Therapy |
| recurrence | Response_To_Treatment |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_oncology_response_to_treatment_wip|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|848.8 KB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Response_To_Treatment 233.0 81.0 120.0 353.0 0.74 0.66 0.70
Size_Trend 31.0 34.0 45.0 76.0 0.48 0.41 0.44
Line_Of_Therapy 82.0 11.0 5.0 87.0 0.88 0.94 0.91
macro_avg 346.0 126.0 170.0 516.0 0.70 0.67 0.68
micro_avg NaN NaN NaN NaN 0.73 0.67 0.70
```
---
layout: model
title: Recognize Entities OntoNotes pipeline - BERT Large
author: John Snow Labs
name: onto_recognize_entities_bert_large
date: 2021-03-23
tags: [open_source, english, onto_recognize_entities_bert_large, pipeline, en]
supported: true
task: [Named Entity Recognition, Lemmatization]
language: en
nav_key: models
edition: Spark NLP 3.0.0
spark_version: 3.0
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The onto_recognize_entities_bert_large is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps
and recognizes entities .
It performs most of the common text processing tasks on your dataframe
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_large_en_3.0.0_3.0_1616475201428.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_large_en_3.0.0_3.0_1616475201428.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipelinein
pipeline = PretrainedPipeline('onto_recognize_entities_bert_large', lang = 'en')
annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]
annotations.keys()
```
```scala
val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_large", lang = "en")
val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0)
```
{:.nlu-block}
```python
import nlu
text = [""Hello from John Snow Labs ! ""]
result_df = nlu.load('en.ner.onto.bert.large').predict(text)
result_df
```
## Results
```bash
| | document | sentence | token | embeddings | ner | entities |
|---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------|
| 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[-0.262016534805297,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|onto_recognize_entities_bert_large|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.0.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from google)
author: John Snow Labs
name: t5_efficient_base_nh24
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nh24` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nh24_en_4.3.0_3.0_1675113000788.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nh24_en_4.3.0_3.0_1675113000788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_base_nh24","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_base_nh24","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_base_nh24|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|636.9 MB|
## References
- https://huggingface.co/google/t5-efficient-base-nh24
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: English BertForQuestionAnswering model (from armageddon)
author: John Snow Labs
name: bert_qa_bert_base_uncased_squad2_covid_qa_deepset
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad2-covid-qa-deepset` is a English model orginally trained by `armageddon`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad2_covid_qa_deepset_en_4.0.0_3.0_1654181524986.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad2_covid_qa_deepset_en_4.0.0_3.0_1654181524986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squad2_covid_qa_deepset","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_uncased_squad2_covid_qa_deepset","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2_covid.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_uncased_squad2_covid_qa_deepset|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/armageddon/bert-base-uncased-squad2-covid-qa-deepset
---
layout: model
title: Sentiment Analysis on texts about Airlines
author: John Snow Labs
name: distilbert_base_sequence_classifier_airlines
date: 2022-02-18
tags: [airlines, distilbert, sequence_classification, en, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 3.4.0
spark_version: 3.0
supported: true
annotator: DistilBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model was imported from `Hugging Face` ([link](https://huggingface.co/tasosk/distilbert-base-uncased-airlines)) and it's been trained on tasosk/airlines dataset, leveraging `Distil-BERT` embeddings and `DistilBertForSequenceClassification` for text classification purposes. The model classifies texts into two categories: `YES` for positive comments, and `NO` for negative.
## Predicted Entities
`YES`, `NO`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_sequence_classifier_airlines_en_3.4.0_3.0_1645179643194.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_sequence_classifier_airlines_en_3.4.0_3.0_1645179643194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = DistilBertForSequenceClassification\
.pretrained('distilbert_base_sequence_classifier_airlines', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class')
pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])
example = spark.createDataFrame([["Jersey to London Gatwick with easyJet and another great flight. Due to the flight time, airport check-in was not open, however I'd checked in a few days before with the easyJet app which was very quick and convenient. Boarding was quick and we left a few minutes early, which is a bonus. The cabin crew were friendly and the aircraft was clean and comfortable. We arrived at Gatwick 5-10 minutes early, and disembarking was as quick as boarding. On the way back, we were about half an hour early landing, which was fantastic. For the short flight from JER-LGW, easyJet are ideal and a bit better than British Airways in my opinion, and the fares are just unmissable. Both flights for two adults cost £180. easyJet can expect my business in the near future."]]).toDF("text")
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val tokenClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_airlines", "en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val example = Seq("Jersey to London Gatwick with easyJet and another great flight. Due to the flight time, airport check-in was not open, however I'd checked in a few days before with the easyJet app which was very quick and convenient. Boarding was quick and we left a few minutes early, which is a bonus. The cabin crew were friendly and the aircraft was clean and comfortable. We arrived at Gatwick 5-10 minutes early, and disembarking was as quick as boarding. On the way back, we were about half an hour early landing, which was fantastic. For the short flight from JER-LGW, easyJet are ideal and a bit better than British Airways in my opinion, and the fares are just unmissable. Both flights for two adults cost £180. easyJet can expect my business in the near future.").toDF("text")
val result = pipeline.fit(example).transform(example)
```
## Results
```bash
['YES']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_base_sequence_classifier_airlines|
|Compatibility:|Spark NLP 3.4.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|249.8 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
[https://huggingface.co/datasets/tasosk/airlines](https://huggingface.co/datasets/tasosk/airlines)
## Benchmarking
```bash
label score
accuracy 0.9288
f1 0.9289
```
---
layout: model
title: Pipeline to Detect Chemicals in Medical Texts
author: John Snow Labs
name: bert_token_classifier_ner_chemicals_pipeline
date: 2022-03-14
tags: [chemicals, bert_token_classifier, pipeline, ner, en, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_chemicals](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_chemicals_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMICALS/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_3.4.1_3.0_1647256416720.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_3.4.1_3.0_1647256416720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
chemicals_pipeline = PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models")
chemicals_pipeline.annotate("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""")
```
```scala
val chemicals_pipeline = new PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models")
chemicals_pipeline.annotate("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.chemicals_pipeline").predict("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""")
```
## Results
```bash
+---------------------------+---------+
|chunk |ner_label|
+---------------------------+---------+
|p - choloroaniline |CHEM |
|chlorhexidine - digluconate|CHEM |
|kanamycin |CHEM |
|colistin |CHEM |
|povidone - iodine |CHEM |
+---------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_chemicals_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.3 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverter
- Finisher
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from botika)
author: John Snow Labs
name: distilbert_qa_botika_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `botika`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_botika_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770249638.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_botika_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770249638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_botika_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_botika_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_botika_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/botika/distilbert-base-uncased-finetuned-squad
---
layout: model
title: English DistilBertForQuestionAnswering Base Uncased model (from graviraja)
author: John Snow Labs
name: distilbert_qa_graviraja_base_uncased_finetuned_squad
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `graviraja`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_graviraja_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770884217.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_graviraja_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770884217.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_graviraja_base_uncased_finetuned_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_graviraja_base_uncased_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_graviraja_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/graviraja/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Extract Demographic Entities from Voice of the Patient Documents (embeddings_clinical_large)
author: John Snow Labs
name: ner_vop_demographic_emb_clinical_large
date: 2023-06-06
tags: [licensed, clinical, ner, en, vop, demographic]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts demographic terms from the documents transferred from the patient’s own sentences.
## Predicted Entities
`Gender`, `Employment`, `RaceEthnicity`, `Age`, `Substance`, `RelationshipStatus`, `SubstanceQuantity`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_emb_clinical_large_en_4.4.3_3.0_1686075195884.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_emb_clinical_large_en_4.4.3_3.0_1686075195884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_vop_demographic_emb_clinical_large", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_vop_demographic_emb_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
| chunk | ner_label |
|:---------|:--------------|
| grandma | Gender |
| who's 85 | Age |
| Black | RaceEthnicity |
| doctors | Employment |
| her | Gender |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_vop_demographic_emb_clinical_large|
|Compatibility:|Healthcare NLP 4.4.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|3.8 MB|
|Dependencies:|embeddings_clinical_large|
## References
In-house annotated health-related text in colloquial language.
## Sample text from the training dataset
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
## Benchmarking
```bash
label tp fp fn total precision recall f1
Gender 1298 21 19 1317 0.98 0.99 0.98
Employment 1180 50 63 1243 0.96 0.95 0.95
RaceEthnicity 31 2 2 33 0.94 0.94 0.94
Age 549 45 33 582 0.92 0.94 0.93
Substance 391 56 30 421 0.87 0.93 0.90
RelationshipStatus 18 3 6 24 0.86 0.75 0.80
SubstanceQuantity 61 14 24 85 0.81 0.72 0.76
macro_avg 3528 191 177 3705 0.91 0.89 0.89
micro_avg 3528 191 177 3705 0.95 0.95 0.95
```
---
layout: model
title: Sentence Entity Resolver for Snomed Concepts, CT version (``sbiobert_base_cased_mli`` embeddings)
author: John Snow Labs
name: sbiobertresolve_snomed_findings
language: en
nav_key: models
repository: clinical/models
date: 2020-11-27
task: Entity Resolution
edition: Healthcare NLP 2.6.4
spark_version: 2.4
tags: [clinical,entity_resolution,en]
supported: true
annotator: SentenceEntityResolverModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model maps extracted medical entities to Snomed codes (CT version) using chunk embeddings.
{:.h2_title}
## Predicted Entities
Snomed Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_en_2.6.4_2.4_1606235762315.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_en_2.6.4_2.4_1606235762315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
```sbiobertresolve_snomed_findings``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. No need to set ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_resolver])
data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
```
```scala
...
chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
val snomed_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_snomed_findings","en", "clinical/models")
.setInputCols(Array("ner_chunk", "sbert_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_resolver))
val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
```bash
+--------------------+-----+---+---------+---------+----------+--------------------+--------------------+
| chunk|begin|end| entity| code|confidence| resolutions| codes|
+--------------------+-----+---+---------+---------+----------+--------------------+--------------------+
| hypertension| 68| 79| PROBLEM| 38341003| 0.3234|hypertension:::hy...|38341003:::155295...|
|chronic renal ins...| 83|109| PROBLEM|723190009| 0.7522|chronic renal ins...|723190009:::70904...|
| COPD| 113|116| PROBLEM| 13645005| 0.1226|copd - chronic ob...|13645005:::155565...|
| gastritis| 120|128| PROBLEM|235653009| 0.2444|gastritis:::gastr...|235653009:::45560...|
| TIA| 136|138| PROBLEM|275382005| 0.0766|cerebral trauma (...|275382005:::44739...|
|a non-ST elevatio...| 182|202| PROBLEM|233843008| 0.2224|silent myocardial...|233843008:::19479...|
|Guaiac positive s...| 208|229| PROBLEM| 59614000| 0.9678|guaiac-positive s...|59614000:::703960...|
|cardiac catheteri...| 295|317| TEST|301095005| 0.2584|cardiac finding::...|301095005:::25090...|
| PTCA| 324|327|TREATMENT|373108000| 0.0809|post percutaneous...|373108000:::25103...|
| mid LAD lesion| 332|345| PROBLEM|449567000| 0.0900|overriding left v...|449567000:::46140...|
+--------------------+-----+---+---------+---------+----------+--------------------+--------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---------------|---------------------|
| Name: | sbiobertresolve_snomed_findings |
| Type: | SentenceEntityResolverModel |
| Compatibility: | Spark NLP 2.6.4 + |
| License: | Licensed |
| Edition: | Official |
|Input labels: | [ner_chunk, chunk_embeddings] |
|Output labels: | [resolution] |
| Language: | en |
| Dependencies: | sbiobert_base_cased_mli |
{:.h2_title}
## Data Source
Trained on SNOMED (CT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings.
http://www.snomed.org/
---
layout: model
title: Fast Neural Machine Translation Model from Swedish to English
author: John Snow Labs
name: opus_mt_sv_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, sv, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `sv`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sv_en_xx_2.7.0_2.4_1609170150107.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sv_en_xx_2.7.0_2.4_1609170150107.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_sv_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_sv_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.sv.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_sv_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering model (from madlag)
author: John Snow Labs
name: bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-wwm-squadv2-x2.15-f83.2-d25-hybrid-v1` is a English model orginally trained by `madlag`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1_en_4.0.0_3.0_1654537627313.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1_en_4.0.0_3.0_1654537627313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert.large_uncased_v2_x2.15_f83.2_d25_hybrid.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|455.4 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/madlag/bert-large-uncased-wwm-squadv2-x2.15-f83.2-d25-hybrid-v1
- https://rajpurkar.github.io/SQuAD-explorer
- https://www.aclweb.org/anthology/N19-1423.pdf
---
layout: model
title: Detect PHI for Deidentification purposes (Spanish, reduced entities, augmented data)
author: John Snow Labs
name: ner_deid_generic_augmented
date: 2022-02-16
tags: [deid, es, licensed]
task: De-identification
language: es
edition: Healthcare NLP 3.3.4
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities (1 more than the `ner_deid_generic` ner model).
This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset, several data augmentation mechanisms and has been augmented with MEDDOCAN Spanish Deidentification corpus (compared to `ner_deid_generic` which does not include it). It's a generalized version of `ner_deid_subentity_augmented`.
## Predicted Entities
`CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE`, `SEX`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_es_3.3.4_3.0_1645006125653.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_es_3.3.4_3.0_1645006125653.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("word_embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner])
text = ['''
Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
''']
df = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(df).transform(df)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner))
val text = "Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."
val df = Seq(text).toDF("text")
val results = pipeline.fit(df).transform(df)
```
{:.nlu-block}
```python
import nlu
nlu.load("es.med_ner.deid.generic_augmented").predict("""
Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom3","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom3","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.distil_bert.custom3.by_aszidon").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_custom3|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/aszidon/distilbertcustom3
---
layout: model
title: German XLMRobertaForTokenClassification Base Cased model (from claytonsamples)
author: John Snow Labs
name: xlmroberta_ner_claytonsamples_base_finetuned_panx
date: 2022-08-13
tags: [de, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: de
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `claytonsamples`.
## Predicted Entities
`PER`, `LOC`, `ORG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_claytonsamples_base_finetuned_panx_de_4.1.0_3.0_1660431662923.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_claytonsamples_base_finetuned_panx_de_4.1.0_3.0_1660431662923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_claytonsamples_base_finetuned_panx","de") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_claytonsamples_base_finetuned_panx","de")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_claytonsamples_base_finetuned_panx|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|de|
|Size:|854.5 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/claytonsamples/xlm-roberta-base-finetuned-panx-de
- https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme
---
layout: model
title: Thai BertForQuestionAnswering model (from airesearch)
author: John Snow Labs
name: bert_qa_bert_base_multilingual_cased_finetune_qa
date: 2022-06-02
tags: [th, open_source, question_answering, bert]
task: Question Answering
language: th
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetune-qa` is a Thai model orginally trained by `airesearch`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetune_qa_th_4.0.0_3.0_1654179974020.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetune_qa_th_4.0.0_3.0_1654179974020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_finetune_qa","th") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_base_multilingual_cased_finetune_qa","th")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("th.answer_question.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_base_multilingual_cased_finetune_qa|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|th|
|Size:|665.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/airesearch/bert-base-multilingual-cased-finetune-qa
- https://github.com/vistec-AI/thai2transformers/blob/dev/scripts/downstream/train_question_answering_lm_finetuning.py
- https://wandb.ai/cstorm125/wangchanberta-qa
---
layout: model
title: Stopwords Remover for French language (507 entries)
author: John Snow Labs
name: stopwords_iso
date: 2022-03-07
tags: [stopwords, fr, open_source]
task: Stop Words Removal
language: fr
edition: Spark NLP 3.4.1
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_fr_3.4.1_3.0_1646673106300.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_fr_3.4.1_3.0_1646673106300.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stop_words = StopWordsCleaner.pretrained("stopwords_iso","fr") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words])
example = spark.createDataFrame([["Tu n'es pas mieux que moi"]], ["text"])
results = pipeline.fit(example).transform(example)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val stop_words = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","fr")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words))
val data = Seq("Tu n'es pas mieux que moi").toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("fr.stopwords").predict("""Tu n'es pas mieux que moi""")
```
## Results
```bash
+-------------+
|result |
+-------------+
|[n'es, mieux]|
+-------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_iso|
|Compatibility:|Spark NLP 3.4.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|fr|
|Size:|2.9 KB|
---
layout: model
title: ALBERT Embeddings (XLarge Uncase)
author: John Snow Labs
name: albert_xlarge_uncased
date: 2020-04-28
task: Embeddings
language: en
nav_key: models
edition: Spark NLP 2.5.0
spark_version: 2.4
tags: [embeddings, en, open_source]
supported: true
annotator: AlBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The details are described in the paper "[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.](https://arxiv.org/abs/1909.11942)"
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_xlarge_uncased_en_2.5.0_2.4_1588073443653.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_xlarge_uncased_en_2.5.0_2.4_1588073443653.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
embeddings = AlbertEmbeddings.pretrained("albert_xlarge_uncased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"]))
```
```scala
...
val embeddings = AlbertEmbeddings.pretrained("albert_xlarge_uncased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.albert.xlarge_uncased').predict(text, output_level='token')
embeddings_df
```
{:.h2_title}
## Results
```bash
token en_embed_albert_xlarge_uncased_embeddings
I [-0.4735468626022339, -0.03991951420903206, -1...
love [-0.4254034459590912, -0.371383935213089, -0.3...
NLP [0.7200506329536438, -0.12543179094791412, -0....
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|albert_xlarge_uncased|
|Type:|embeddings|
|Compatibility:|Spark NLP 2.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[word_embeddings]|
|Language:|[en]|
|Dimension:|2048|
|Case sensitive:|false|
{:.h2_title}
## Data Source
The model is imported from [https://tfhub.dev/google/albert_xlarge/3](https://tfhub.dev/google/albert_xlarge/3)
---
layout: model
title: Polish T5ForConditionalGeneration Base Cased model (from azwierzc)
author: John Snow Labs
name: t5_plt5_base_poquad
date: 2023-01-30
tags: [pl, open_source, t5, tensorflow]
task: Text Generation
language: pl
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `plt5-base-poquad` is a Polish model originally trained by `azwierzc`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_plt5_base_poquad_pl_4.3.0_3.0_1675106743524.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_plt5_base_poquad_pl_4.3.0_3.0_1675106743524.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_plt5_base_poquad","pl") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_plt5_base_poquad","pl")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_plt5_base_poquad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|pl|
|Size:|1.1 GB|
## References
- https://huggingface.co/azwierzc/plt5-base-poquad
---
layout: model
title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations)
author: John Snow Labs
name: legner_mapa
date: 2023-04-28
tags: [cs, licensed, legal, ner, mapa]
task: Named Entity Recognition
language: cs
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Czech` documents.
## Predicted Entities
`ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_cs_1.0.0_3.0_1682668776380.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_cs_1.0.0_3.0_1682668776380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_czech_legal","cs")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "cs", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""V roce 2007 uzavřela společnost Alpenrind, dříve S GmbH, se společností Martin-Meat usazenou v Maďarsku smlouvu, podle níž se posledně uvedená společnost zavázala k porcování masa a jeho balení v rozsahu 25 půlek jatečně upravených těl skotu týdně."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
```
## Results
```bash
+-----------+------------+
|chunk |ner_label |
+-----------+------------+
|2007 |DATE |
|Alpenrind |ORGANISATION|
|Martin-Meat|ORGANISATION|
|Maďarsku |ADDRESS |
|25 půlek |AMOUNT |
+-----------+------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legner_mapa|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|cs|
|Size:|1.4 MB|
## References
The dataset is available [here](https://huggingface.co/datasets/joelito/mapa).
## Benchmarking
```bash
label precision recall f1-score support
ADDRESS 0.80 0.67 0.73 36
AMOUNT 1.00 1.00 1.00 5
DATE 0.98 0.98 0.98 56
ORGANISATION 0.64 0.66 0.65 32
PERSON 0.75 0.82 0.78 66
micro-avg 0.81 0.82 0.81 195
macro-avg 0.83 0.82 0.83 195
weighted-avg 0.81 0.82 0.81 195
```
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from shmuelamar)
author: John Snow Labs
name: roberta_qa_re
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `REQA-RoBERTa` is a English model originally trained by `shmuelamar`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_re_en_4.3.0_3.0_1674208450623.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_re_en_4.3.0_3.0_1674208450623.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_re","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_re","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_re|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/shmuelamar/REQA-RoBERTa
---
layout: model
title: Detect Clinical Entities in Romanian (w2v_cc_300d)
author: John Snow Labs
name: ner_clinical
date: 2022-07-01
tags: [licenced, clinical, ro, ner, w2v, licensed]
task: Named Entity Recognition
language: ro
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Extract clinical entities from Romanian clinical texts. This model is trained using Romanian `w2v_cc_300d` embeddings.
## Predicted Entities
`Measurements`, `Form`, `Symptom`, `Route`, `Procedure`, `Disease_Syndrome_Disorder`, `Score`, `Drug_Ingredient`, `Pulse`, `Frequency`, `Date`, `Body_Part`, `Drug_Brand_Name`, `Time`, `Direction`, `Dosage`, `Medical_Device`, `Imaging_Technique`, `Test`, `Imaging_Findings`, `Imaging_Test`, `Test_Result`, `Weight`, `Clinical_Dept`, `Units`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_ro_4.0.0_3.0_1656687302322.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_ro_4.0.0_3.0_1656687302322.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "ro") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "ro", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")\
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter])
sample_text = """ Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min."""
data = spark.createDataFrame([[sample_text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "ro","clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter))
val data = Seq("""Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("ro.med_ner.clinical").predict(""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""")
```
## Results
```bash
+--------------------------+-------------------------+
|chunks |entities |
+--------------------------+-------------------------+
|Angio CT |Imaging_Test |
|cardio-toracic |Body_Part |
|Atrezie |Disease_Syndrome_Disorder|
|valva pulmonara |Body_Part |
|Hipoplazie |Disease_Syndrome_Disorder|
|VS |Body_Part |
|Atrezie |Disease_Syndrome_Disorder|
|VAV stang |Body_Part |
|Anastomoza Glenn |Disease_Syndrome_Disorder|
|Sp |Body_Part |
|Tromboza |Disease_Syndrome_Disorder|
|Sectia Clinica Cardiologie|Clinical_Dept |
|GE Revolution HD |Medical_Device |
|Branula albastra |Medical_Device |
|membrului superior |Body_Part |
|drept |Direction |
|30 ml |Dosage |
|Iomeron 350 |Drug_Ingredient |
|2.2 ml/s |Dosage |
|20 ml |Dosage |
+--------------------------+-------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_clinical|
|Compatibility:|Healthcare NLP 4.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ro|
|Size:|15.0 MB|
## Benchmarking
```bash
label precision recall f1-score support
Body_Part 0.87 0.90 0.88 689
Clinical_Dept 0.68 0.62 0.65 97
Date 1.00 0.99 0.99 87
Direction 0.64 0.74 0.69 50
Disease_Syndrome_Disorder 0.69 0.66 0.67 123
Dosage 0.74 0.97 0.84 38
Drug_Ingredient 0.98 0.92 0.95 48
Form 1.00 1.00 1.00 6
Imaging_Findings 0.74 0.76 0.75 202
Imaging_Technique 0.92 0.88 0.90 26
Imaging_Test 0.93 0.97 0.95 208
Measurements 0.70 0.67 0.69 214
Medical_Device 0.92 0.81 0.86 42
Pulse 0.82 1.00 0.90 9
Route 0.97 0.91 0.94 33
Score 0.91 0.95 0.93 41
Time 1.00 1.00 1.00 28
Units 0.60 0.89 0.71 88
Weight 1.00 1.00 1.00 9
micro-avg 0.82 0.84 0.83 2054
macro-avg 0.70 0.72 0.71 2054
weighted-avg 0.81 0.84 0.82 2054
```
---
layout: model
title: Part of Speech for Norwegian
author: John Snow Labs
name: pos_ud_bokmaal
date: 2022-01-11
tags: [pos, norwegian, nb, open_source]
task: Part of Speech Tagging
language: nb
edition: Spark NLP 3.4.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
This model was trained using the dataset available at https://universaldependencies.org
{:.btn-box}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bokmaal_nb_3.4.0_3.0_1641902661339.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bokmaal_nb_3.4.0_3.0_1641902661339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pos = PerceptronModel.pretrained("pos_ud_bokmaal", "nb") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.")
```
```scala
val pos = PerceptronModel.pretrained("pos_ud_bokmaal", "nb")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene."""]
pos_df = nlu.load('nb.pos.ud_bokmaal').predict(text)
pos_df
```
## Results
```bash
[Row(annotatorType='pos', begin=0, end=4, result='DET', metadata={'word': 'Annet'}),
Row(annotatorType='pos', begin=6, end=8, result='SCONJ', metadata={'word': 'enn'}),
Row(annotatorType='pos', begin=10, end=10, result='PART', metadata={'word': 'å'}),
Row(annotatorType='pos', begin=12, end=15, result='AUX', metadata={'word': 'være'}),
Row(annotatorType='pos', begin=17, end=22, result='NOUN', metadata={'word': 'kongen'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pos_ud_bokmaal|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.4.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|nb|
|Size:|17.7 KB|
## Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- PerceptronModel
---
layout: model
title: French CamemBert Embeddings (from ppletscher)
author: John Snow Labs
name: camembert_embeddings_dummy
date: 2022-05-23
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy` is a French model orginally trained by `ppletscher`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_dummy_fr_3.4.4_3.0_1653321214351.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_dummy_fr_3.4.4_3.0_1653321214351.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_dummy","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_dummy","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_dummy|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/ppletscher/dummy
---
layout: model
title: SDOH Under Treatment For Classification
author: John Snow Labs
name: genericclassifier_sdoh_under_treatment_sbiobert_cased_mli
date: 2023-04-27
tags: [en, licensed, clinical, sdoh, generic_classifier, under_treatment, biobert]
task: Text Classification
language: en
edition: Healthcare NLP 4.3.2
spark_version: 3.0
supported: true
annotator: GenericClassifierModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This Generic Classifier model is intended for detecting if the patient is under treatment or not. If under treatment is not mentioned in the text, it is regarded as “not under treatment”. The model is trained by using GenericClassifierApproach annotator.
`Under_Treatment`: The patient is under treatment.
`Not_Under_Treatment_Or_Not_Mentioned`: The patient is not under treatment or it is not mentioned in the clinical notes.
## Predicted Entities
`Under_Treatment`, `Not_Under_Treatment_Or_Not_Mentioned`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_under_treatment_sbiobert_cased_mli_en_4.3.2_3.0_1682608513576.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_under_treatment_sbiobert_cased_mli_en_4.3.2_3.0_1682608513576.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
features_asm = FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_under_treatment_sbiobert_cased_mli", 'en', 'clinical/models')\
.setInputCols(["features"])\
.setOutputCol("prediction")
pipeline = Pipeline(stages=[
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier
])
text_list = ["""Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disease, presented to her primary care physician with complaints of chest pain and shortness of breath. After a thorough evaluation, Sarah was diagnosed with coronary artery disease (CAD), a condition that can lead to heart attacks and other serious complications.
To manage her CAD, Sarah was started on a treatment plan that included medication to lower her cholesterol and blood pressure, as well as aspirin to prevent blood clots. In addition to medication, Sarah was advised to make lifestyle modifications such as improving her diet, quitting smoking, and increasing physical activity.
Over the course of several months, Sarah's symptoms improved, and follow-up tests showed that her cholesterol and blood pressure were within the target range. However, Sarah continued to experience occasional chest pain, and her medication regimen was adjusted accordingly.
With regular follow-up appointments and adherence to her treatment plan, Sarah's CAD remained under control, and she was able to resume her normal activities with improved quality of life.
""",
"""John, a 60-year-old man with a history of smoking and high blood pressure, presented to his primary care physician with complaints of chest pain and shortness of breath. Further tests revealed that John had a blockage in one of his coronary arteries, which required urgent intervention. However, John was hesitant to undergo treatment, citing concerns about potential complications and side effects of medications and procedures.
Despite the physician's recommendations and attempts to educate John about the risks of leaving the blockage untreated, John ultimately chose not to pursue any treatment. Over the next several months, John continued to experience symptoms, which progressively worsened, and he ultimately required hospitalization for a heart attack. The medical team attempted to intervene at that point, but the damage to John's heart was severe, and his prognosis was poor.
"""]
df = spark.createDataFrame(text_list, StringType()).toDF("text")
result = pipeline.fit(df).transform(df)
result.select("text", "prediction.result").show(truncate=100)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("features")
val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_under_treatment_sbiobert_cased_mli", "en", "clinical/models")
.setInputCols("features")
.setOutputCol("prediction")
val pipeline = new PipelineModel().setStages(Array(
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier))
val data = Seq(Array("""Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disease, presented to her primary care physician with complaints of chest pain and shortness of breath. After a thorough evaluation, Sarah was diagnosed with coronary artery disease (CAD), a condition that can lead to heart attacks and other serious complications.
To manage her CAD, Sarah was started on a treatment plan that included medication to lower her cholesterol and blood pressure, as well as aspirin to prevent blood clots. In addition to medication, Sarah was advised to make lifestyle modifications such as improving her diet, quitting smoking, and increasing physical activity.
Over the course of several months, Sarah's symptoms improved, and follow-up tests showed that her cholesterol and blood pressure were within the target range. However, Sarah continued to experience occasional chest pain, and her medication regimen was adjusted accordingly.
With regular follow-up appointments and adherence to her treatment plan, Sarah's CAD remained under control, and she was able to resume her normal activities with improved quality of life.
""",
"""John, a 60-year-old man with a history of smoking and high blood pressure, presented to his primary care physician with complaints of chest pain and shortness of breath. Further tests revealed that John had a blockage in one of his coronary arteries, which required urgent intervention. However, John was hesitant to undergo treatment, citing concerns about potential complications and side effects of medications and procedures.
Despite the physician's recommendations and attempts to educate John about the risks of leaving the blockage untreated, John ultimately chose not to pursue any treatment. Over the next several months, John continued to experience symptoms, which progressively worsened, and he ultimately required hospitalization for a heart attack. The medical team attempted to intervene at that point, but the damage to John's heart was severe, and his prognosis was poor.
""")).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+--------------------------------------+
| text| result|
+----------------------------------------------------------------------------------------------------+--------------------------------------+
|Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disea...| [Under_Treatment]|
|John, a 60-year-old man with a history of smoking and high blood pressure, presented to his prima...|[Not_Under_Treatment_Or_Not_Mentioned]|
+----------------------------------------------------------------------------------------------------+--------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|genericclassifier_sdoh_under_treatment_sbiobert_cased_mli|
|Compatibility:|Healthcare NLP 4.3.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[features]|
|Output Labels:|[prediction]|
|Language:|en|
|Size:|3.4 MB|
|Dependencies:|sbiobert_base_cased_mli|
## References
Internal SDOH Project
## Benchmarking
```bash
label precision recall f1-score support
Not_Under_Treatment_Or_Not_Mentioned 0.86 0.68 0.76 222
Under_Treatment 0.86 0.94 0.90 450
accuracy - - 0.86 672
macro-avg 0.86 0.81 0.83 672
weighted-avg 0.86 0.86 0.85 672
```
---
layout: model
title: English BertForQuestionAnswering model (from madlag)
author: John Snow Labs
name: bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1` is a English model orginally trained by `madlag`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1_en_4.0.0_3.0_1654183687318.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1_en_4.0.0_3.0_1654183687318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.bert.large_uncased_v2_x2.63_f82.6_d16_hybrid.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|349.6 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1
- https://rajpurkar.github.io/SQuAD-explorer
- https://www.aclweb.org/anthology/N19-1423.pdf
---
layout: model
title: German XlmRoBertaForQuestionAnswering (from bhavikardeshna)
author: John Snow Labs
name: xlm_roberta_qa_xlm_roberta_base_german
date: 2022-06-23
tags: [de, open_source, question_answering, xlmroberta]
task: Question Answering
language: de
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-german` is a German model originally trained by `bhavikardeshna`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_german_de_4.0.0_3.0_1655989699247.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_german_de_4.0.0_3.0_1655989699247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_german","de") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifier = XlmRoBertaForQuestionAnswering
.pretrained("xlm_roberta_qa_xlm_roberta_base_german","de")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("de.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlm_roberta_qa_xlm_roberta_base_german|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|de|
|Size:|883.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/bhavikardeshna/xlm-roberta-base-german
---
layout: model
title: English BertForSequenceClassification Tiny Cased model (from mrm8488)
author: John Snow Labs
name: bert_sequence_classifier_tiny_finetuned_fake_news_detection
date: 2022-07-13
tags: [en, open_source, bert, sequence_classification]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-finetuned-fake-news-detection` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_tiny_finetuned_fake_news_detection_en_4.0.0_3.0_1657720809070.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_tiny_finetuned_fake_news_detection_en_4.0.0_3.0_1657720809070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
classifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_tiny_finetuned_fake_news_detection","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val classifer = BertForSequenceClassification.pretrained("bert_sequence_classifier_tiny_finetuned_fake_news_detection","en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_tiny_finetuned_fake_news_detection|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|16.9 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mrm8488/bert-tiny-finetuned-fake-news-detection
---
layout: model
title: Translate South Slavic languages to English Pipeline
author: John Snow Labs
name: translate_zls_en
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, zls, en, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `zls`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_zls_en_xx_2.7.0_2.4_1609688390022.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_zls_en_xx_2.7.0_2.4_1609688390022.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_zls_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_zls_en", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.zls.translate_to.en').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_zls_en|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Detect Assertion Status from Smoking Status Entity
author: John Snow Labs
name: assertion_oncology_smoking_status_wip
date: 2022-10-01
tags: [licensed, clinical, oncology, en, assertion]
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 4.1.0
spark_version: 3.0
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model detects the assertion status of the Smoking_Status entity. It classifies extractions as Present, Past or Absent.
## Predicted Entities
`Absent`, `Past`, `Present`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_smoking_status_wip_en_4.1.0_3.0_1664641973214.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_smoking_status_wip_en_4.1.0_3.0_1664641973214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["Smoking_Status"])
assertion = AssertionDLModel.pretrained("assertion_oncology_smoking_status_wip", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
assertion])
data = spark.createDataFrame([["The patient quit smoking three years ago."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Smoking_Status"))
val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_smoking_status_wip","en","clinical/models")
.setInputCols(Array("sentence","ner_chunk","embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
assertion))
val data = Seq("""The patient quit smoking three years ago.""").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert.oncology_smoking_status").predict("""The patient quit smoking three years ago.""")
```
## Results
```bash
| chunk | ner_label | assertion |
|:--------|:---------------|:------------|
| smoking | Smoking_Status | Past |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|assertion_oncology_smoking_status_wip|
|Compatibility:|Healthcare NLP 4.1.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, chunk, embeddings]|
|Output Labels:|[assertion_pred]|
|Language:|en|
|Size:|1.4 MB|
## References
In-house annotated oncology case reports.
## Benchmarking
```bash
label precision recall f1-score support
Absent 0.75 1.00 0.86 12.0
Past 0.78 0.93 0.85 15.0
Present 1.00 0.46 0.63 13.0
macro-avg 0.84 0.80 0.78 40.0
weighted-avg 0.84 0.80 0.78 40.0
```
---
layout: model
title: English RobertaForQuestionAnswering (from huxxx657)
author: John Snow Labs
name: roberta_qa_roberta_base_finetuned_scrambled_squad_5
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-5` is a English model originally trained by `huxxx657`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_5_en_4.0.0_3.0_1655734166418.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_5_en_4.0.0_3.0_1655734166418.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_5","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_5","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.base_scrambled_5.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_base_finetuned_scrambled_squad_5|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|463.8 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-5
---
layout: model
title: Legal Legends Clause Binary Classifier
author: John Snow Labs
name: legclf_legends_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `legends` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `legends`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_legends_clause_en_1.0.0_3.2_1660123668543.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_legends_clause_en_1.0.0_3.2_1660123668543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[legends]|
|[other]|
|[other]|
|[legends]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_legends_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.0 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
legends 0.98 0.98 0.98 57
other 0.99 0.99 0.99 128
accuracy - - 0.99 185
macro-avg 0.99 0.99 0.99 185
weighted-avg 0.99 0.99 0.99 185
```
---
layout: model
title: German asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377
date: 2022-09-26
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377` is a German model originally trained by jonatasgrosman.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377_de_4.2.0_3.0_1664189496988.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377_de_4.2.0_3.0_1664189496988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: Fast Neural Machine Translation Model from Celtic Languages to English
author: John Snow Labs
name: opus_mt_cel_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, cel, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `cel`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_cel_en_xx_2.7.0_2.4_1609168916224.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_cel_en_xx_2.7.0_2.4_1609168916224.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_cel_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_cel_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.cel.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_cel_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: German asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412 TFWav2Vec2ForCTC from jonatasgrosman
author: John Snow Labs
name: asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412
date: 2022-09-25
tags: [wav2vec2, de, audio, open_source, asr]
task: Automatic Speech Recognition
language: de
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412` is a German model originally trained by jonatasgrosman.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412_de_4.2.0_3.0_1664112235753.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412_de_4.2.0_3.0_1664112235753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412", "de")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412", "de")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|de|
|Size:|1.2 GB|
---
layout: model
title: Sentence Embeddings - sbert medium (tuned)
author: John Snow Labs
name: sbert_jsl_medium_uncased
date: 2021-05-14
tags: [embeddings, clinical, licensed, en]
task: Embeddings
language: en
nav_key: models
edition: Healthcare NLP 3.0.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is trained to generate contextual sentence embeddings of input sentences.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_uncased_en_3.0.3_2.4_1621017111185.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_uncased_en_3.0.3_2.4_1621017111185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
sbiobert_embeddings = BertSentenceEmbeddings\
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")\
.setInputCols(["sentence"])\
.setOutputCol("sbert_embeddings")
```
```scala
val sbiobert_embeddings = BertSentenceEmbeddings
.pretrained("sbert_jsl_medium_uncased","en","clinical/models")
.setInputCols(Array("sentence"))
.setOutputCol("sbert_embeddings")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.embed_sentence.bert.jsl_medium_uncased").predict("""Put your text here.""")
```
## Results
```bash
Gives a 768 dimensional vector representation of the sentence.
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbert_jsl_medium_uncased|
|Compatibility:|Healthcare NLP 3.0.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Case sensitive:|false|
## Data Source
Tuned on MedNLI dataset
## Benchmarking
```bash
MedNLI Score
Acc 0.724
STS(cos) 0.743
```
---
layout: model
title: Translate English to Tongan Pipeline
author: John Snow Labs
name: translate_en_to
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, to, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `to`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_to_xx_2.7.0_2.4_1609686800591.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_to_xx_2.7.0_2.4_1609686800591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_to", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_to", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.to').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_to|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English RobertaForQuestionAnswering (from rahulchakwate)
author: John Snow Labs
name: roberta_qa_roberta_large_finetuned_squad
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-finetuned-squad` is a English model originally trained by `rahulchakwate`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_finetuned_squad_en_4.0.0_3.0_1655736911099.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_finetuned_squad_en_4.0.0_3.0_1655736911099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_roberta_large_finetuned_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.roberta.large.by_rahulchakwate").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_roberta_large_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/rahulchakwate/roberta-large-finetuned-squad
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6_en_4.3.0_3.0_1674213584769.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6_en_4.3.0_3.0_1674213584769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|439.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-6
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla)
author: John Snow Labs
name: roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8_en_4.3.0_3.0_1674214068825.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8_en_4.3.0_3.0_1674214068825.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|423.0 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-8
---
layout: model
title: English image_classifier_vit_dog ViTForImageClassification from Sena
author: John Snow Labs
name: image_classifier_vit_dog
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_dog` is a English model originally trained by Sena.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_en_4.1.0_3.0_1660169568437.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_en_4.1.0_3.0_1660169568437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_dog", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_dog", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_dog|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Pipeline to Detect Adverse Drug Events (MedicalBertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ade_tweet_binary_pipeline
date: 2023-03-20
tags: [clinical, licensed, ade, en, medicalbertfortokenclassification, ner]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ade_tweet_binary](https://nlp.johnsnowlabs.com/2022/07/29/bert_token_classifier_ade_tweet_binary_en_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ade_tweet_binary_pipeline_en_4.3.0_3.2_1679298990358.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ade_tweet_binary_pipeline_en_4.3.0_3.2_1679298990358.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ade_tweet_binary_pipeline", "en", "clinical/models")
text = '''I used to be on paxil but that made me more depressed and prozac made me angry. Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ade_tweet_binary_pipeline", "en", "clinical/models")
val text = "I used to be on paxil but that made me more depressed and prozac made me angry. Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:-----------------|--------:|------:|:------------|-------------:|
| 0 | depressed | 44 | 52 | ADE | 0.999755 |
| 1 | angry | 73 | 77 | ADE | 0.999608 |
| 2 | insulin blocking | 97 | 112 | ADE | 0.738712 |
| 3 | sugar crashes | 147 | 159 | ADE | 0.993742 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ade_tweet_binary_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.7 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: Multilingual XLMRoBerta Embeddings (from castorini)
author: John Snow Labs
name: xlmroberta_embeddings_afriberta_large
date: 2022-05-13
tags: [ha, yo, ig, am, so, open_source, xlm_roberta, embeddings, xx, afriberta]
task: Embeddings
language: xx
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: XlmRoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `afriberta_large` is a Multilingual model orginally trained by `castorini`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_large_xx_3.4.4_3.0_1652439242600.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_large_xx_3.4.4_3.0_1652439242600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_large","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_large","xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_embeddings_afriberta_large|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Size:|471.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/castorini/afriberta_large
- https://github.com/keleog/afriberta
---
layout: model
title: Detect Chemical Compounds and Genes
author: John Snow Labs
name: ner_chemprot_clinical
date: 2021-03-31
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a pre-trained model that can be used to automatically detect all chemical compounds and gene mentions from medical texts.
## Predicted Entities
: `CHEMICAL`, `GENE-Y`, `GENE-N`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMPROT_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CHEMPROT_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_en_3.0.0_3.0_1617208430062.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_en_3.0.0_3.0_1617208430062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium."]]).toDF("text"))
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.chemprot.clinical").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""")
```
## Results
```bash
+----+---------------------------------+---------+-------+----------+
| | chunk | begin | end | entity |
+====+=================================+=========+=======+==========+
| 0 | Keratinocyte growth factor | 0 | 25 | GENE-Y |
+----+---------------------------------+---------+-------+----------+
| 1 | acidic fibroblast growth factor | 31 | 61 | GENE-Y |
+----+---------------------------------+---------+-------+----------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_chemprot_clinical|
|Compatibility:|Healthcare NLP 3.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
## Data Source
This model was trained on the ChemProt corpus using 'embeddings_clinical' embeddings. Make sure you use the same embeddings when running the model.
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|:--------------|-------:|------:|-----:|---------:|---------:|---------:|
| 0 | B-GENE-Y | 4650 | 1090 | 838 | 0.810105 | 0.847303 | 0.828286 |
| 1 | B-GENE-N | 1732 | 981 | 1019 | 0.638408 | 0.629589 | 0.633968 |
| 2 | I-GENE-Y | 1846 | 571 | 573 | 0.763757 | 0.763125 | 0.763441 |
| 3 | B-CHEMICAL | 7512 | 804 | 1136 | 0.903319 | 0.86864 | 0.88564 |
| 4 | I-CHEMICAL | 1059 | 169 | 253 | 0.862378 | 0.807165 | 0.833858 |
| 5 | I-GENE-N | 1393 | 853 | 598 | 0.620214 | 0.699648 | 0.657541 |
| 6 | Macro-average | 18192 | 4468 | 4417 | 0.766363 | 0.769245 | 0.767801 |
| 7 | Micro-average | 18192 | 4468 | 4417 | 0.802824 | 0.804635 | 0.803729 |
```
---
layout: model
title: Arabic BertForMaskedLM Large Cased model (from aubmindlab)
author: John Snow Labs
name: bert_embeddings_large_arabertv02
date: 2022-12-02
tags: [ar, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ar
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-arabertv02` is a Arabic model originally trained by `aubmindlab`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_arabertv02_ar_4.2.4_3.0_1670019689670.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_arabertv02_ar_4.2.4_3.0_1670019689670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_arabertv02","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_arabertv02","ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_large_arabertv02|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ar|
|Size:|1.4 GB|
|Case sensitive:|true|
## References
- https://huggingface.co/aubmindlab/bert-large-arabertv02
- https://github.com/google-research/bert
- https://arxiv.org/abs/2003.00104
- https://github.com/WissamAntoun/pydata_khobar_meetup
- http://alt.qcri.org/farasa/segmenter.html
- /aubmindlab/bert-large-arabertv02/blob/main/(https://github.com/google-research/bert/blob/master/multilingual.md)
- https://github.com/elnagara/HARD-Arabic-Dataset
- https://www.aclweb.org/anthology/D15-1299
- https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf
- https://github.com/mohamedadaly/LABR
- http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp
- https://github.com/husseinmozannar/SOQAL
- https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md
- https://arxiv.org/abs/2003.00104v2
- https://archive.org/details/arwiki-20190201
- https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4
- https://www.aclweb.org/anthology/W19-4619
- https://sites.aub.edu.lb/mindlab/
- https://www.yakshof.com/#/
- https://www.behance.net/rahalhabib
- https://www.linkedin.com/in/wissam-antoun-622142b4/
- https://twitter.com/wissam_antoun
- https://github.com/WissamAntoun
- https://www.linkedin.com/in/fadybaly/
- https://twitter.com/fadybaly
- https://github.com/fadybaly
---
layout: model
title: Map Companies to their Acquisitions (wikipedia, en)
author: John Snow Labs
name: finmapper_wikipedia_parentcompanies
date: 2023-01-13
tags: [parent, companies, subsidiaries, en, licensed]
task: Chunk Mapping
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This models allows you to, given an extracter ORG, retrieve all the parent / subsidiary /companies acquired and/or in the same group than it.
IMPORTANT: This requires an exact match as the name appears in Wikidata. If you are not sure the name is the same, pleas run `finmapper_wikipedia_parentcompanies` to normalize the company name first.
## Predicted Entities
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_wikipedia_parentcompanies_en_1.0.0_3.0_1673610612510.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_wikipedia_parentcompanies_en_1.0.0_3.0_1673610612510.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained('finner_orgs_prods_alias', 'en', 'finance/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
# Optional: To normalize the ORG name using Wikipedia data before the mapping
##########################################################################
chunkToDoc = nlp.Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
chunk_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
.setInputCols("ner_chunk_doc") \
.setOutputCol("sentence_embeddings")
use_er_model = finance.SentenceEntityResolverModel.pretrained("finel_wikipedia_parentcompanies", "en", "finance/models") \
.setInputCols(["ner_chunk_doc", "sentence_embeddings"]) \
.setOutputCol("normalized")\
.setDistanceFunction("EUCLIDEAN")
##########################################################################
cm = finance.ChunkMapperModel()\
.pretrained("finmapper_wikipedia_parentcompanies", "en", "finance/models")\
.setInputCols(["normalized"])\
.setOutputCol("mappings") # or ner_chunk for non normalized versions
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter,
chunkToDoc,
chunk_embeddings,
use_er_model,
cm
])
text = ["""Barclays is an American multinational bank which operates worldwide."""]
test_data = spark.createDataFrame([text]).toDF("text")
model = nlpPipeline.fit(test_data)
lp = nlp.LightPipeline(model)
lp.annotate(text)
```
## Results
```bash
{'mappings': ['http://www.wikidata.org/entity/Q245343',
'Barclays@en-ca',
'http://www.wikidata.org/prop/direct/P355',
'is_parent_of',
'London Stock Exchange@en',
'BARC',
'בנק ברקליס@he',
'http://www.wikidata.org/entity/Q29488227'],
...
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finmapper_wikipedia_parentcompanies|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|852.6 KB|
## References
Wikidata
---
layout: model
title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_greek_1 TFWav2Vec2ForCTC from skylord
author: John Snow Labs
name: pipeline_asr_wav2vec2_large_xlsr_greek_1
date: 2022-09-25
tags: [wav2vec2, el, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: el
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_greek_1` is a Modern Greek (1453-) model originally trained by skylord.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_greek_1_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_greek_1_el_4.2.0_3.0_1664110254247.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_greek_1_el_4.2.0_3.0_1664110254247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_greek_1', lang = 'el')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_greek_1", lang = "el")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_large_xlsr_greek_1|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|el|
|Size:|1.2 GB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: NER Pipeline - Voice of the Patient
author: John Snow Labs
name: ner_vop_pipeline
date: 2023-06-09
tags: [pipeline, ner, en, licensed, vop]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.4.3
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline includes the full taxonomy Named-Entity Recognition model to extract information from health-related text in colloquial language. This pipeline extracts diagnoses, treatments, tests, anatomical references and demographic entities.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_pipeline_en_4.4.3_3.0_1686338017684.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_pipeline_en_4.4.3_3.0_1686338017684.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_vop_pipeline", "en", "clinical/models")
pipeline.annotate("
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_vop_pipeline", "en", "clinical/models")
val result = pipeline.annotate("
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_deit_base_patch16_224", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_deit_base_patch16_224", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_deit_base_patch16_224|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|324.7 MB|
---
layout: model
title: Financial 10K Filings NER
author: John Snow Labs
name: finner_10k_summary
date: 2022-08-17
tags: [en, finance, ner, annual, reports, 10k, filings, licensed]
task: Named Entity Recognition
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
IMPORTANT: Don't run this model on the whole financial report. Instead:
- Split by paragraphs;
- Use the `finclf_form_10k_summary_item` Text Classifier to select only these paragraphs;
This Financial NER Model is aimed to process the first summary page of 10K filings and extract the information about the Company submitting the filing, trading data, address / phones, CFN, IRS, etc.
## Predicted Entities
`ADDRESS`, `CFN`, `FISCAL_YEAR`, `IRS`, `ORG`, `PHONE`, `STATE`, `STOCK_EXCHANGE`, `TICKER`, `TITLE_CLASS`, `TITLE_CLASS_VALUE`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_SEC10K_FIRSTPAGE/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_10k_summary_en_1.0.0_3.2_1660732829888.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_10k_summary_en_1.0.0_3.2_1660732829888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setCustomBounds(["\n\n"])
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_finbert_pretrain_yiyanghkust","en")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_model = finance.NerModel.pretrained("finner_10k_summary","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")\
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame([["""ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES AND EXCHANGE ACT OF 1934
For the annual period ended January 31, 2021
or
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from________to_______
Commission File Number: 001-38856
PAGERDUTY, INC.
(Exact name of registrant as specified in its charter)
Delaware
27-2793871
(State or other jurisdiction of
incorporation or organization)
(I.R.S. Employer
Identification Number)
600 Townsend St., Suite 200, San Francisco, CA 94103
(844) 800-3889
(Address, including zip code, and telephone number, including area code, of registrant’s principal executive offices)
Securities registered pursuant to Section 12(b) of the Act:
Title of each class
Trading symbol(s)
Name of each exchange on which registered
Common Stock, $0.000005 par value,
PD
New York Stock Exchange"""]]).toDF("text")
result = model.transform(data)
result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("ticker"),
F.expr("cols['1']['entity']").alias("label")).show(50, truncate = False)
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tiny_base_cased_distilled_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tiny_base_cased_distilled_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_tiny_cased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_tiny_base_cased_distilled_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|641.4 KB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/sshleifer/tiny-distilbert-base-cased-distilled-squad
---
layout: model
title: Stop Words Cleaner for Latin
author: John Snow Labs
name: stopwords_la
date: 2020-07-14 19:03:00 +0800
task: Stop Words Removal
language: la
edition: Spark NLP 2.5.4
spark_version: 2.4
tags: [stopwords, la]
supported: true
annotator: StopWordsCleaner
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions.
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_la_la_2.5.4_2.4_1594742439769.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_la_la_2.5.4_2.4_1594742439769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
stop_words = StopWordsCleaner.pretrained("stopwords_la", "la") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate("Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene.")
```
```scala
...
val stopWords = StopWordsCleaner.pretrained("stopwords_la", "la")
.setInputCols(Array("token"))
.setOutputCol("cleanTokens")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords))
val data = Seq("Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene.").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["""Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene."""]
stopword_df = nlu.load('la.stopwords').predict(text)
stopword_df[['cleanTokens']]
```
{:.h2_title}
## Results
```bash
[Row(annotatorType='token', begin=0, end=4, result='Alius', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=10, end=13, result='esse', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=15, end=19, result='regem', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=21, end=29, result='Aquilonis', metadata={'sentence': '0'}),
Row(annotatorType='token', begin=30, end=30, result=',', metadata={'sentence': '0'}),
...]
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|stopwords_la|
|Type:|stopwords|
|Compatibility:|Spark NLP 2.5.4+|
|Edition:|Official|
|Input Labels:|[token]|
|Output Labels:|[cleanTokens]|
|Language:|la|
|Case sensitive:|false|
|License:|Open Source|
{:.h2_title}
## Data Source
The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)
---
layout: model
title: English BertForQuestionAnswering model (from vuiseng9)
author: John Snow Labs
name: bert_qa_bert_l_squadv1.1_sl256
date: 2022-06-06
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-l-squadv1.1-sl256` is a English model orginally trained by `vuiseng9`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_l_squadv1.1_sl256_en_4.0.0_3.0_1654536057484.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_l_squadv1.1_sl256_en_4.0.0_3.0_1654536057484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_l_squadv1.1_sl256","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bert_l_squadv1.1_sl256","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bert.sl256.by_vuiseng9").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bert_l_squadv1.1_sl256|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/vuiseng9/bert-l-squadv1.1-sl256
---
layout: model
title: Bangla RoBERTa Embeddings (from neuralspace-reverie)
author: John Snow Labs
name: roberta_embeddings_indic_transformers_bn_roberta
date: 2022-04-14
tags: [roberta, embeddings, bn, open_source]
task: Embeddings
language: bn
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-bn-roberta` is a Bangla model orginally trained by `neuralspace-reverie`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_bn_roberta_bn_3.4.2_3.0_1649947557406.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_bn_roberta_bn_3.4.2_3.0_1649947557406.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers_bn_roberta","bn") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["আমি স্পার্ক এনএলপি ভালোবাসি"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers_bn_roberta","bn")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("আমি স্পার্ক এনএলপি ভালোবাসি").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_indic_transformers_bn_roberta|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|bn|
|Size:|312.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/neuralspace-reverie/indic-transformers-bn-roberta
- https://oscar-corpus.com/
---
layout: model
title: Pipeline to Detect Genes and Human Phenotypes
author: John Snow Labs
name: ner_human_phenotype_gene_biobert_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, gene, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_human_phenotype_gene_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_human_phenotype_gene_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_biobert_pipeline_en_3.4.1_3.0_1647867336282.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_biobert_pipeline_en_3.4.1_3.0_1647867336282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline("ner_human_phenotype_gene_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).")
```
```scala
val pipeline = new PretrainedPipeline("ner_human_phenotype_gene_biobert_pipeline", "en", "clinical/models")
pipeline.annotate("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.human_phenotype_gene_biobert.pipeline").predict("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""")
```
## Results
```bash
+----------------+--------+
|chunks |entities|
+----------------+--------+
|type |GENE |
|polyhydramnios |HP |
|polyuria |HP |
|nephrocalcinosis|HP |
|hypokalemia |HP |
+----------------+--------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_human_phenotype_gene_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.1 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
---
layout: model
title: English BertForQuestionAnswering model (from xraychen)
author: John Snow Labs
name: bert_qa_mqa_unsupsim
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mqa-unsupsim` is a English model orginally trained by `xraychen`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_unsupsim_en_4.0.0_3.0_1654188398566.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_unsupsim_en_4.0.0_3.0_1654188398566.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mqa_unsupsim","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_mqa_unsupsim","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.bert.unsupsim.by_xraychen").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_mqa_unsupsim|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|407.7 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/xraychen/mqa-unsupsim
---
layout: model
title: Pipeline to Detect Normalized Genes and Human Phenotypes (biobert)
author: John Snow Labs
name: ner_human_phenotype_go_biobert_pipeline
date: 2023-03-20
tags: [ner, clinical, licensed, en]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_human_phenotype_go_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_human_phenotype_go_biobert_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_biobert_pipeline_en_4.3.0_3.2_1679315805636.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_biobert_pipeline_en_4.3.0_3.2_1679315805636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_human_phenotype_go_biobert_pipeline", "en", "clinical/models")
text = '''Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_human_phenotype_go_biobert_pipeline", "en", "clinical/models")
val text = "Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.phenotype_go_biobert.pipeline").predict("""Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.""")
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:-------------------------|--------:|------:|:------------|-------------:|
| 0 | tumor | 39 | 43 | HP | 1 |
| 1 | tricarboxylic acid cycle | 79 | 102 | GO | 0.999867 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_human_phenotype_go_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.1 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverterInternalModel
---
layout: model
title: English RobertaForSequenceClassification Cased model (from joey234)
author: John Snow Labs
name: roberta_classifier_cuenb_mnli
date: 2022-12-09
tags: [en, open_source, roberta, sequence_classification, classification, tensorflow]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuenb-mnli` is a English model originally trained by `joey234`.
## Predicted Entities
`entailment`, `contradiction`, `neutral`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_cuenb_mnli_en_4.2.4_3.0_1670624962389.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_cuenb_mnli_en_4.2.4_3.0_1670624962389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_cuenb_mnli","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier])
data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_cuenb_mnli","en")
.setInputCols(Array("document", "token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier))
val data = Seq("I love you!").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_classifier_cuenb_mnli|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|468.7 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/joey234/cuenb-mnli
- https://paperswithcode.com/sota?task=Text+Classification&dataset=GLUE+MNLI
---
layout: model
title: Fast Neural Machine Translation Model from Austro-Asiatic languages to English
author: John Snow Labs
name: opus_mt_aav_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, aav, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `aav`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_aav_en_xx_2.7.0_2.4_1609169255439.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_aav_en_xx_2.7.0_2.4_1609169255439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_aav_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_aav_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.aav.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_aav_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbiobertresolve_icd10cm_slim_billable_hcc)
author: John Snow Labs
name: sbiobertresolve_icd10cm_slim_billable_hcc
date: 2021-05-25
tags: [icd10cm, slim, licensed, en]
task: Entity Resolution
language: en
nav_key: models
edition: Healthcare NLP 3.0.3
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model maps clinical entities and concepts to ICD10 CM codes using sentence biobert embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped. It also returns the official resolution text within the brackets inside the metadata. The model is augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths).
## Predicted Entities
Outputs 7-digit billable ICD codes. In the result, look for aux_label parameter in the metadata to get HCC status. The HCC status can be divided to get further information: billable status, hcc status, and hcc score.For example, in the example shared below the billable status is 1, hcc status is 1, and hcc score is 11.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_3.0.3_2.4_1621942329774.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_3.0.3_2.4_1621942329774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sbert_embedder = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sbert_embeddings")
icd10_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc ","en", "clinical/models") \
.setInputCols(["document", "sbert_embeddings"]) \
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN")\
.setReturnCosineDistances(True)
bert_pipeline_icd = PipelineModel(stages = [document_assembler, sbert_embedder, icd10_resolver])
data = spark.createDataFrame([["bladder cancer"]]).toDF("text")
results = bert_pipeline_icd.fit(data).transform(data)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sbert_embeddings")
val icd10_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models")
.setInputCols(Array("document", "sbert_embeddings"))
.setOutputCol("icd10cm_code")
.setDistanceFunction("EUCLIDEAN").setReturnCosineDistances(True)
val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver))
val data = Seq("bladder cancer").toDF("text")
val result = bert_pipeline_icd.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.resolve.icd10cm.slim_billable_hcc").predict("""sbiobertresolve_icd10cm_slim_billable_hcc """)
```
## Results
```bash
| | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances |
|---:|:---------------|:--------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|-------------------------------------------:|:----------------------------|:---------------------------------------------------------|
| 0 | bladder cancer | C671 |[bladder cancer, dome [Malignant neoplasm of dome of bladder], cancer of the urinary bladder [Malignant neoplasm of bladder, unspecified], adenocarcinoma, bladder neck [Malignant neoplasm of bladder neck], cancer in situ of urinary bladder [Carcinoma in situ of bladder], cancer of the urinary bladder, ureteric orifice [Malignant neoplasm of ureteric orifice], tumor of bladder neck [Neoplasm of unspecified behavior of bladder], cancer of the urethra [Malignant neoplasm of urethra]]| [C671, C679, C675, D090, C676, D494, C680] | ['1', '1', '11'] | [0.0685, 0.0709, 0.0963, 0.0978, 0.1068, 0.1080, 0.1211] |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|sbiobertresolve_icd10cm_slim_billable_hcc|
|Compatibility:|Healthcare NLP 3.0.3+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[icd10cm_code]|
|Language:|en|
|Case sensitive:|false|
---
layout: model
title: Legal No Waivers Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_no_waivers_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, no_waivers, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `No_Waivers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`No_Waivers`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_waivers_bert_en_1.0.0_3.0_1678050601610.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_waivers_bert_en_1.0.0_3.0_1678050601610.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[No_Waivers]|
|[Other]|
|[Other]|
|[No_Waivers]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_no_waivers_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.5 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
No_Waivers 0.90 0.98 0.94 54
Other 0.99 0.92 0.95 73
accuracy - - 0.94 127
macro-avg 0.94 0.95 0.94 127
weighted-avg 0.95 0.94 0.95 127
```
---
layout: model
title: English DistilBertForQuestionAnswering Small Cased model (from ncduy)
author: John Snow Labs
name: distilbert_qa_base_cased_led_squad_finetuned_small
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad-small` is a English model originally trained by `ncduy`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_finetuned_small_en_4.3.0_3.0_1672766627900.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_finetuned_small_en_4.3.0_3.0_1672766627900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_finetuned_small","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_finetuned_small","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_cased_led_squad_finetuned_small|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|244.2 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/ncduy/distilbert-base-cased-distilled-squad-finetuned-squad-small
---
layout: model
title: French CamemBert Embeddings (from mbateman)
author: John Snow Labs
name: camembert_embeddings_mbateman_generic_model
date: 2022-05-31
tags: [fr, open_source, camembert, embeddings]
task: Embeddings
language: fr
edition: Spark NLP 3.4.4
spark_version: 3.0
supported: true
annotator: CamemBertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `mbateman`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_mbateman_generic_model_fr_3.4.4_3.0_1653989653583.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_mbateman_generic_model_fr_3.4.4_3.0_1653989653583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_mbateman_generic_model","fr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_mbateman_generic_model","fr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("J'adore Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|camembert_embeddings_mbateman_generic_model|
|Compatibility:|Spark NLP 3.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|fr|
|Size:|266.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/mbateman/dummy-model
---
layout: model
title: English BertForQuestionAnswering model (from bioformers)
author: John Snow Labs
name: bert_qa_bioformer_cased_v1.0_squad1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bioformer-cased-v1.0-squad1` is a English model orginally trained by `bioformers`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bioformer_cased_v1.0_squad1_en_4.0.0_3.0_1654185774933.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bioformer_cased_v1.0_squad1_en_4.0.0_3.0_1654185774933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bioformer_cased_v1.0_squad1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_bioformer_cased_v1.0_squad1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.bioformer.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_bioformer_cased_v1.0_squad1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|158.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/bioformers/bioformer-cased-v1.0-squad1
- https://rajpurkar.github.io/SQuAD-explorer
- https://arxiv.org/pdf/1910.01108.pdf
---
layout: model
title: Abkhazian asr_speech_sprint_test TFWav2Vec2ForCTC from Mofe
author: John Snow Labs
name: pipeline_asr_speech_sprint_test
date: 2022-09-24
tags: [wav2vec2, ab, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: ab
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_speech_sprint_test` is a Abkhazian model originally trained by Mofe.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_speech_sprint_test_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_speech_sprint_test_ab_4.2.0_3.0_1664021417370.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_speech_sprint_test_ab_4.2.0_3.0_1664021417370.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_speech_sprint_test', lang = 'ab')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_speech_sprint_test", lang = "ab")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_speech_sprint_test|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|ab|
|Size:|452.6 KB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Legal Counterparts Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_counterparts_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, counterparts, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Counterparts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Counterparts`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_counterparts_bert_en_1.0.0_3.0_1678050553108.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_counterparts_bert_en_1.0.0_3.0_1678050553108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Counterparts]|
|[Other]|
|[Other]|
|[Counterparts]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_counterparts_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.7 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Counterparts 1.00 0.99 1.0 302
Other 0.99 1.00 1.0 338
accuracy - - 1.0 640
macro-avg 1.00 1.00 1.0 640
weighted-avg 1.00 1.00 1.0 640
```
---
layout: model
title: Portuguese RoBERTa Embeddings (from rdenadai)
author: John Snow Labs
name: roberta_embeddings_BR_BERTo
date: 2022-04-14
tags: [roberta, embeddings, pt, open_source]
task: Embeddings
language: pt
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: RoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `BR_BERTo` is a Portuguese model orginally trained by `rdenadai`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_BR_BERTo_pt_3.4.2_3.0_1649947632437.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_BR_BERTo_pt_3.4.2_3.0_1649947632437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_BR_BERTo","pt") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_BR_BERTo","pt")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("Eu amo Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("pt.embed.BR_BERTo").predict("""Eu amo Spark NLP""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_embeddings_BR_BERTo|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|pt|
|Size:|637.2 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/rdenadai/BR_BERTo
- https://github.com/rdenadai/BR-BERTo
---
layout: model
title: Chinese BertForMaskedLM Base Cased model
author: John Snow Labs
name: bert_embeddings_base_chinese
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-chinese` is a Chinese model originally trained by HuggingFace.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_chinese_zh_4.2.4_3.0_1670016364528.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_chinese_zh_4.2.4_3.0_1670016364528.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_chinese","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_chinese","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_chinese|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|383.8 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/bert-base-chinese
- https://aclanthology.org/2021.acl-long.330.pdf
- https://dl.acm.org/doi/pdf/10.1145/3442188.3445922
---
layout: model
title: Chinese BertForMaskedLM Cased model (from uer)
author: John Snow Labs
name: bert_embeddings_chinese_roberta_l_12_h_128
date: 2022-12-02
tags: [zh, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: zh
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-128` is a Chinese model originally trained by `uer`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_128_zh_4.2.4_3.0_1670021529933.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_128_zh_4.2.4_3.0_1670021529933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_128","zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_128","zh")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_chinese_roberta_l_12_h_128|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|zh|
|Size:|20.1 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/uer/chinese_roberta_L-12_H-128
- https://github.com/dbiir/UER-py/
- https://arxiv.org/abs/1909.05658
- https://arxiv.org/abs/1908.08962
- https://github.com/dbiir/UER-py/wiki/Modelzoo
- https://github.com/CLUEbenchmark/CLUECorpus2020/
- https://github.com/dbiir/UER-py/
- https://cloud.tencent.com/
---
layout: model
title: Legal Sick leave Clause Binary Classifier
author: John Snow Labs
name: legclf_sick_leave_clause
date: 2022-08-10
tags: [en, legal, classification, clauses, licensed]
task: Text Classification
language: en
nav_key: models
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `sick-leave` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`other`, `sick-leave`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sick_leave_clause_en_1.0.0_3.2_1660124014656.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sick_leave_clause_en_1.0.0_3.2_1660124014656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[sick-leave]|
|[other]|
|[other]|
|[sick-leave]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_sick_leave_clause|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|23.1 MB|
## References
Legal documents, scrapped from the Internet, and classified in-house
## Benchmarking
```bash
label precision recall f1-score support
other 0.96 0.99 0.98 80
sick-leave 0.97 0.93 0.95 42
accuracy - - 0.97 122
macro-avg 0.97 0.96 0.96 122
weighted-avg 0.97 0.97 0.97 122
```
---
layout: model
title: Marathi Bert Embeddings (from monsoon-nlp)
author: John Snow Labs
name: bert_embeddings_muril_adapted_local
date: 2022-04-11
tags: [bert, embeddings, mr, open_source]
task: Embeddings
language: mr
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a Marathi model orginally trained by `monsoon-nlp`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_mr_3.4.2_3.0_1649675136793.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_mr_3.4.2_3.0_1649675136793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","mr") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["मला स्पार्क एनएलपी आवडते"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","mr")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("मला स्पार्क एनएलपी आवडते").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("mr.embed.muril_adapted_local").predict("""मला स्पार्क एनएलपी आवडते""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_muril_adapted_local|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|mr|
|Size:|888.7 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/monsoon-nlp/muril-adapted-local
- https://tfhub.dev/google/MuRIL/1
---
layout: model
title: Detect Assertion Status (assertion_dl) - supports confidence scores
author: John Snow Labs
name: assertion_dl
date: 2021-01-26
task: Assertion Status
language: en
nav_key: models
edition: Healthcare NLP 2.7.2
spark_version: 2.4
tags: [assertion, en, licensed, clinical]
supported: true
annotator: AssertionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Assign assertion status to clinical entities extracted by NER based on their context in the text.
## Predicted Entities
`absent`, `present`, `conditional`, `associated_with_someone_else`, `hypothetical`, `possible`.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange}
[Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_en_2.7.2_2.4_1611647201607.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_en_2.7.2_2.4_1611647201607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
clinical_assertion
])
data = spark.createDataFrame([["The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family.', 'Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population.', 'The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively.', 'We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair', '(bp) insertion/deletion.', 'Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle.', 'The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes."]]).toDF("text")
model = nlpPipeline.fit(data)
result = model.transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
clinical_assertion
))
val text = """The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family.', 'Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population.', 'The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively.', 'We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair', '(bp) insertion/deletion.', 'Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle.', 'The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes."""
val data = Seq(text).toDS.toDF("text")
val results = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.assert").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family.', 'Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population.', 'The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively.', 'We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair', '(bp) insertion/deletion.', 'Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle.', 'The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_canard","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_canard","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_canard|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|465.1 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/peggyhuang/roberta-canard
---
layout: model
title: Detect Problems, Tests and Treatments (ner_healthcare)
author: John Snow Labs
name: ner_healthcare_en
date: 2020-03-26
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 2.4.4
spark_version: 2.4
tags: [ner, en, licensed, clinical]
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
{:.h2_title}
## Description
Pretrained named entity recognition deep learning model for healthcare. Includes Problem, Test and Treatment entities. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
{:.h2_title}
## Predicted Entities
``PROBLEM``, ``TEST``, ``TREATMENT``.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CLINICAL/){:.button.button-orange}
[Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_en_2.4.4_2.4_1585188313964.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_en_2.4.4_2.4_1585188313964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
{:.h2_title}
## How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %}
```python
...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_healthcare", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."]]).toDF("text"))
results = model.transform(data)
```
```scala
...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_healthcare", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq(A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .).toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.h2_title}
## Results
The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline.
```bash
| | chunk | ner_label |
|---|-------------------------------|-----------|
| 0 | a respiratory tract infection | PROBLEM |
| 1 | metformin | TREATMENT |
| 2 | glipizide | TREATMENT |
| 3 | dapagliflozin | TREATMENT |
| 4 | T2DM | PROBLEM |
| 5 | atorvastatin | TREATMENT |
| 6 | gemfibrozil | TREATMENT |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_healthcare_en_2.4.4_2.4|
|Type:|ner|
|Compatibility:|Spark NLP 2.4.4|
|Edition:|Official|
|License:|Licensed|
|Input Labels:|[sentence,token, embeddings]|
|Output Labels:|[ner]|
|Language:|[en]|
|Case sensitive:|false|
{:.h2_title}
## Data Source
Trained on 2010 i2b2 challenge data with 'embeddings_healthcare_100d'.
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
{:.h2_title}
## Benchmarking
```bash
| | label | tp | fp | fn | prec | rec | f1 |
|---:|:--------------|------:|------:|------:|---------:|---------:|---------:|
| 0 | I-TREATMENT | 6625 | 1187 | 1329 | 0.848054 | 0.832914 | 0.840416 |
| 1 | I-PROBLEM | 15142 | 1976 | 2542 | 0.884566 | 0.856254 | 0.87018 |
| 2 | B-PROBLEM | 11005 | 1065 | 1587 | 0.911765 | 0.873968 | 0.892466 |
| 3 | I-TEST | 6748 | 923 | 1264 | 0.879677 | 0.842237 | 0.86055 |
| 4 | B-TEST | 8196 | 942 | 1029 | 0.896914 | 0.888455 | 0.892665 |
| 5 | B-TREATMENT | 8271 | 1265 | 1073 | 0.867345 | 0.885167 | 0.876165 |
| 6 | Macro-average | 55987 | 7358 | 8824 | 0.881387 | 0.863166 | 0.872181 |
| 7 | Micro-average | 55987 | 7358 | 8824 | 0.883842 | 0.86385 | 0.873732 |
```
---
layout: model
title: Pipeline to Detect Anatomical Regions
author: John Snow Labs
name: bert_token_classifier_ner_anatomy_pipeline
date: 2022-03-21
tags: [licensed, ner, anatomy, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_anatomy](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_anatomy_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_pipeline_en_3.4.1_3.0_1647857125493.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_pipeline_en_3.4.1_3.0_1647857125493.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
anatomy_pipeline = PretrainedPipeline("bert_token_classifier_ner_anatomy_pipeline", "en", "clinical/models")
anatomy_pipeline.annotate("This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.")
```
```scala
val anatomy_pipeline = new PretrainedPipeline("bert_token_classifier_ner_anatomy_pipeline", "en", "clinical/models")
anatomy_pipeline.annotate("This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.
Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.token_bert.anatomy_pipeline").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""")
```
## Results
```bash
+-------------------+----------------------+
|chunk |ner_label |
+-------------------+----------------------+
|great toe |Multi-tissue_structure|
|skin |Organ |
|conjunctivae |Multi-tissue_structure|
|Extraocular muscles|Multi-tissue_structure|
|Nares |Multi-tissue_structure|
|turbinates |Multi-tissue_structure|
|Oropharynx |Multi-tissue_structure|
|Mucous membranes |Tissue |
|Neck |Organism_subdivision |
|bowel |Organ |
|great toe |Multi-tissue_structure|
|skin |Organ |
|toenails |Organism_subdivision |
|foot |Organism_subdivision |
|great toe |Multi-tissue_structure|
|toenails |Organism_subdivision |
+-------------------+----------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_anatomy_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|404.8 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverter
---
layout: model
title: English RobertaForQuestionAnswering (from twmkn9)
author: John Snow Labs
name: roberta_qa_distilroberta_base_squad2
date: 2022-06-20
tags: [en, open_source, question_answering, roberta]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-squad2` is a English model originally trained by `twmkn9`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_squad2_en_4.0.0_3.0_1655728339007.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_squad2_en_4.0.0_3.0_1655728339007.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilroberta_base_squad2","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = RoBertaForQuestionAnswering
.pretrained("roberta_qa_distilroberta_base_squad2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squadv2.roberta.distilled_base.by_twmkn9").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_distilroberta_base_squad2|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[question, context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|307.0 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/twmkn9/distilroberta-base-squad2
---
layout: model
title: Legal Duties Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_duties_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, duties, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Duties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Duties`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_duties_bert_en_1.0.0_3.0_1678050512891.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_duties_bert_en_1.0.0_3.0_1678050512891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Duties]|
|[Other]|
|[Other]|
|[Duties]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_duties_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.4 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Duties 0.92 0.92 0.92 39
Other 0.95 0.95 0.95 61
accuracy - - 0.94 100
macro-avg 0.94 0.94 0.94 100
weighted-avg 0.94 0.94 0.94 100
```
---
layout: model
title: Relation extraction between body parts and direction entities (ReDL).
author: John Snow Labs
name: redl_bodypart_direction_biobert
date: 2021-02-04
task: Relation Extraction
language: en
nav_key: models
edition: Healthcare NLP 2.7.3
spark_version: 2.4
tags: [licensed, clinical, en, relation_extraction]
supported: true
annotator: RelationExtractionDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Relation extraction between body parts entities like `Internal_organ_or_component`, `External_body_part_or_region` etc. and Direction entities like `upper`, `lower` in clinical texts. `1` : Shows the body part and direction entity are related, `0` : Shows the body part and direction entity are not related.
## Predicted Entities
`0`, `1`
{:.btn-box}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_direction_biobert_en_2.7.3_2.4_1612447753332.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_direction_biobert_en_2.7.3_2.4_1612447753332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = sparknlp.annotators.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = NerConverter() \
.setInputCols(["sentences", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentences", "pos_tags", "tokens"]) \
.setOutputCol("dependencies")
# Set a filter on pairs of named entities which will be treated as relation candidates
re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setMaxSyntacticDistance(10)\
.setOutputCol("re_ner_chunks")\
.setRelationPairs(['direction-external_body_part_or_region',
'external_body_part_or_region-direction',
'direction-internal_organ_or_component',
'internal_organ_or_component-direction'
])
# The dataset this model is trained to is sentence-wise.
# This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
re_model = RelationExtractionDLModel()\
.pretrained('redl_bodypart_direction_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model])
text ="MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia"
p_model = pipeline.fit(spark.createDataFrame([[text]]).toDF("text"))
result = p_model.transform(data)
```
```scala
...
val documenter = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = sparknlp.annotators.Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = NerConverter()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val re_ner_chunk_filter = RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
.setRelationPairs(Array('direction-external_body_part_or_region',
'external_body_part_or_region-direction',
'direction-internal_organ_or_component',
'internal_organ_or_component-direction'))
// The dataset this model is trained to is sentence-wise.
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val re_model = RelationExtractionDLModel()
.pretrained("redl_bodypart_direction_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model))
val data = Seq("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation").predict("""MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia""")
```
## Results
```bash
| index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence |
|-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------|
| 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 |
| 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 |
| 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 |
| 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 |
| 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 |
| 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 |
| 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 |
| 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 |
| 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_bodypart_direction_biobert|
|Compatibility:|Healthcare NLP 2.7.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
## Data Source
Trained on an internal dataset.
## Benchmarking
```bash
Relation Recall Precision F1 Support
0 0.856 0.873 0.865 153
1 0.986 0.984 0.985 1347
Avg. 0.921 0.929 0.925
```
---
layout: model
title: Pipeline to Extract Entities in Spanish Clinical Trial Abstracts (BertForTokenClassification)
author: John Snow Labs
name: bert_token_classifier_ner_clinical_trials_abstracts_pipeline
date: 2023-03-20
tags: [es, clinical, licensed, token_classification, bert, ner]
task: Named Entity Recognition
language: es
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_token_classifier_ner_clinical_trials_abstracts](https://nlp.johnsnowlabs.com/2022/08/11/bert_token_classifier_ner_clinical_trials_abstracts_es_3_0.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_pipeline_es_4.3.0_3.2_1679298645358.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_pipeline_es_4.3.0_3.2_1679298645358.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("bert_token_classifier_ner_clinical_trials_abstracts_pipeline", "es", "clinical/models")
text = '''Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("bert_token_classifier_ner_clinical_trials_abstracts_pipeline", "es", "clinical/models")
val text = "Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales."
val result = pipeline.fullAnnotate(text)
```
## Results
```bash
| | ner_chunk | begin | end | ner_label | confidence |
|---:|:------------------------|--------:|------:|:------------|-------------:|
| 0 | suplementación | 13 | 26 | PROC | 0.999993 |
| 1 | ácido fólico | 32 | 43 | CHEM | 0.999753 |
| 2 | niveles de homocisteína | 55 | 77 | PROC | 0.997803 |
| 3 | hemodiálisis | 101 | 112 | PROC | 0.999993 |
| 4 | hiperhomocisteinemia | 118 | 137 | DISO | 0.999995 |
| 5 | niveles de homocisteína | 248 | 270 | PROC | 0.999988 |
| 6 | tHcy | 279 | 282 | PROC | 0.999989 |
| 7 | ácido fólico | 309 | 320 | CHEM | 0.999987 |
| 8 | vitamina B6 | 324 | 334 | CHEM | 0.999967 |
| 9 | pp | 337 | 338 | CHEM | 0.999889 |
| 10 | diálisis | 388 | 395 | PROC | 0.999993 |
| 11 | función residual | 398 | 414 | PROC | 0.999948 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_token_classifier_ner_clinical_trials_abstracts_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 4.3.0+|
|License:|Licensed|
|Edition:|Official|
|Language:|es|
|Size:|410.6 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- MedicalBertForTokenClassifier
- NerConverterInternalModel
---
layout: model
title: Mapping Drugs With Their Corresponding Actions And Treatments
author: John Snow Labs
name: drug_action_treatment_mapper
date: 2022-04-04
tags: [en, chunkmapping, chunkmapper, drug, action, treatment, licensed]
task: Chunk Mapping
language: en
nav_key: models
edition: Healthcare NLP 3.4.2
spark_version: 3.0
supported: true
annotator: ChunkMapperModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained model maps drugs with their corresponding `action` and `treatment`. `action` refers to the function of the drug in various body systems, `treatment` refers to which disease the drug is used to treat
## Predicted Entities
`action`, `treatment`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/drug_action_treatment_mapper_en_3.4.2_3.0_1649098201229.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/drug_action_treatment_mapper_en_3.4.2_3.0_1649098201229.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
ner = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\
.setInputCols(["token","sentence"])\
.setOutputCol("ner")
nerconverter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("drug")
chunkerMapper = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models") \
.setInputCols("drug")\
.setOutputCol("relations")\
.setRel("treatment") #or action
pipeline = Pipeline().setStages([document_assembler,
sentence_detector,
tokenizer,
ner,
nerconverter,
chunkerMapper])
text = ["""The patient is a 71-year-old female patient of Dr. X. and she was given Aklis and Dermovate.
Cureent Medications: Diprivan, Proventil """]
data = spark.createDataFrame([text]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val ner = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")
.setInputCols(Array("token","sentence"))
.setOutputCol("ner")
val nerconverter = NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("drug")
val chunkerMapper = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models")
.setInputCols("drug")
.setOutputCol("relations")
.setRel("treatment")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner, nerconverter, chunkerMapper ))
val text_data = Seq("""The patient is a 71-year-old female patient of Dr. X. and she was given Aklis and Dermovate.
Cureent Medications: Diprivan, Proventil""").toDF("text")
val res = pipeline.fit(test_data).transform(test_data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.map_entity.drug_to_action_treatment").predict("""
The patient is a 71-year-old female patient of Dr. X. and she was given Aklis and Dermovate.
Cureent Medications: Diprivan, Proventil
""")
```
## Results
```bash
+---------+------------------+--------------------------------------------------------------+
|Drug |Treats |Pharmaceutical Action |
+---------+------------------+--------------------------------------------------------------+
|Aklis |Hyperlipidemia |Hypertension:::Diabetic Kidney Disease:::Cerebrovascular... |
|Dermovate|Lupus |Discoid Lupus Erythematosus:::Empeines:::Psoriasis:::Eczema...|
|Diprivan |Infection |Laryngitis:::Pneumonia:::Pharyngitis |
|Proventil|Addison's Disease |Allergic Conjunctivitis:::Anemia:::Ankylosing Spondylitis |
+---------+------------------+--------------------------------------------------------------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|drug_action_treatment_mapper|
|Compatibility:|Healthcare NLP 3.4.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[ner_chunk]|
|Output Labels:|[mappings]|
|Language:|en|
|Size:|8.7 MB|
---
layout: model
title: Translate English to Afrikaans Pipeline
author: John Snow Labs
name: translate_en_af
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, af, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `af`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_af_xx_2.7.0_2.4_1609689571052.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_af_xx_2.7.0_2.4_1609689571052.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_af", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_af", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.af').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_af|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForQuestionAnswering model (from batterydata)
author: John Snow Labs
name: bert_qa_batteryonlybert_uncased_squad_v1
date: 2022-06-02
tags: [en, open_source, question_answering, bert]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `batteryonlybert-uncased-squad-v1` is a English model orginally trained by `batterydata`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_batteryonlybert_uncased_squad_v1_en_4.0.0_3.0_1654179355858.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_batteryonlybert_uncased_squad_v1_en_4.0.0_3.0_1654179355858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_batteryonlybert_uncased_squad_v1","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)
pipeline = Pipeline().setStages([
document_assembler,
spanClassifier
])
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)
```
```scala
val document = new MultiDocumentAssembler()
.setInputCols("question", "context")
.setOutputCols("document_question", "document_context")
val spanClassifier = BertForQuestionAnswering
.pretrained("bert_qa_batteryonlybert_uncased_squad_v1","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document, spanClassifier))
val example = Seq(
("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."),
("What's my name?", "My name is Clara and I live in Berkeley."))
.toDF("question", "context")
val result = pipeline.fit(example).transform(example)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad_battery.bert.uncased_only_bert.by_batterydata").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_qa_batteryonlybert_uncased_squad_v1|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|408.7 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/batterydata/batteryonlybert-uncased-squad-v1
- https://github.com/ShuHuang/batterybert
---
layout: model
title: Fast Neural Machine Translation Model from Xhosa to English
author: John Snow Labs
name: opus_mt_xh_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, xh, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `xh`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_xh_en_xx_2.7.0_2.4_1609164176872.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_xh_en_xx_2.7.0_2.4_1609164176872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_xh_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_xh_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.xh.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_xh_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering model (from holtin) Squad
author: John Snow Labs
name: distilbert_qa_holtin_base_uncased_finetuned_squad
date: 2022-06-08
tags: [en, open_source, distilbert, question_answering]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: DistilBertForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `holtin`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_holtin_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725474082.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_holtin_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725474082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = MultiDocumentAssembler() \
.setInputCols(["question", "context"]) \
.setOutputCols(["document_question", "document_context"])
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_holtin_base_uncased_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, spanClassifier])
data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_holtin_base_uncased_finetuned_squad","en")
.setInputCols(Array("document", "token"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier))
val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.answer_question.squad.distil_bert.base_uncased_v2.by_holtin").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_holtin_base_uncased_finetuned_squad|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.5 MB|
|Case sensitive:|false|
|Max sentence length:|512|
## References
- https://huggingface.co/holtin/distilbert-base-uncased-finetuned-squad
---
layout: model
title: Legal Consents Clause Binary Classifier (LEDGAR)
author: John Snow Labs
name: legclf_consents_bert
date: 2023-03-05
tags: [en, legal, classification, clauses, consents, licensed, tensorflow]
task: Text Classification
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LegalClassifierDLModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.
This model is a Binary Classifier (True, False) for the `Consents` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.
## Predicted Entities
`Consents`, `Other`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_consents_bert_en_1.0.0_3.0_1678050585575.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_consents_bert_en_1.0.0_3.0_1678050585575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
|result|
+-------+
|[Consents]|
|[Other]|
|[Other]|
|[Consents]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|legclf_consents_bert|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[class]|
|Language:|en|
|Size:|22.6 MB|
## References
Train dataset available [here](https://huggingface.co/datasets/lex_glue)
## Benchmarking
```bash
label precision recall f1-score support
Consents 0.81 0.94 0.87 49
Other 0.95 0.85 0.90 72
accuracy - - 0.88 121
macro-avg 0.88 0.89 0.88 121
weighted-avg 0.89 0.88 0.89 121
```
---
layout: model
title: English Named Entity Recognition (from dslim)
author: John Snow Labs
name: bert_ner_bert_base_NER
date: 2022-05-09
tags: [bert, ner, token_classification, en, open_source]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-NER` is a English model orginally trained by `dslim`.
## Predicted Entities
`LOC`, `PER`, `ORG`, `MISC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_NER_en_3.4.2_3.0_1652096558277.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_NER_en_3.4.2_3.0_1652096558277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_NER","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_NER","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_bert_base_NER|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|404.2 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/dslim/bert-base-NER
- https://www.aclweb.org/anthology/W03-0419.pdf
- https://www.aclweb.org/anthology/W03-0419.pdf
- https://arxiv.org/pdf/1810.04805
- https://github.com/google-research/bert/issues/223
---
layout: model
title: English image_classifier_vit_electric_2 ViTForImageClassification from smc
author: John Snow Labs
name: image_classifier_vit_electric_2
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_electric_2` is a English model originally trained by smc.
## Predicted Entities
`poles`, `transformers`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_electric_2_en_4.1.0_3.0_1660171547079.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_electric_2_en_4.1.0_3.0_1660171547079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_electric_2", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_electric_2", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_electric_2|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Urdu BertForMaskedLM Base Cased model (from Geotrend)
author: John Snow Labs
name: bert_embeddings_base_ur_cased
date: 2022-12-02
tags: [ur, open_source, bert_embeddings, bertformaskedlm]
task: Embeddings
language: ur
edition: Spark NLP 4.2.4
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-ur-cased` is a Urdu model originally trained by `Geotrend`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ur_cased_ur_4.2.4_3.0_1670019274115.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ur_cased_ur_4.2.4_3.0_1670019274115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ur_cased","ur") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded])
data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ur_cased","ur")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded))
val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_base_ur_cased|
|Compatibility:|Spark NLP 4.2.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|ur|
|Size:|348.0 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/Geotrend/bert-base-ur-cased
- https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf
- https://github.com/Geotrend-research/smaller-transformers
---
layout: model
title: English T5ForConditionalGeneration Large Cased model (from google)
author: John Snow Labs
name: t5_efficient_large_nl12
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-nl12` is a English model originally trained by `google`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl12_en_4.3.0_3.0_1675117150157.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl12_en_4.3.0_3.0_1675117150157.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_efficient_large_nl12","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_efficient_large_nl12","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_efficient_large_nl12|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|802.0 MB|
## References
- https://huggingface.co/google/t5-efficient-large-nl12
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://arxiv.org/abs/2109.10686
- https://arxiv.org/abs/2109.10686
- https://github.com/google-research/google-research/issues/986#issuecomment-1035051145
---
layout: model
title: Fast Neural Machine Translation Model from English to Hungarian
author: John Snow Labs
name: opus_mt_en_hu
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, en, hu, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `en`
- target languages: `hu`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_hu_xx_2.7.0_2.4_1609170999360.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_hu_xx_2.7.0_2.4_1609170999360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_hu", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_en_hu", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.hu').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_en_hu|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Translate English to Pohnpeian Pipeline
author: John Snow Labs
name: translate_en_pon
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, pon, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `pon`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_pon_xx_2.7.0_2.4_1609690686455.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_pon_xx_2.7.0_2.4_1609690686455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_pon", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_pon", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.pon').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_pon|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Fast Neural Machine Translation Model from Rundi to English
author: John Snow Labs
name: opus_mt_run_en
date: 2020-12-28
task: Translation
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, run, en, xx]
supported: true
annotator: MarianTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
- source languages: `run`
- target languages: `en`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_run_en_xx_2.7.0_2.4_1609166179021.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_run_en_xx_2.7.0_2.4_1609166179021.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_run_en", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_run_en", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
opus_df = nlu.load('xx.run.marian.translate_to.en').predict(text, output_level='sentence')
opus_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_run_en|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English BertForTokenClassification Cased model (from ghadeermobasher)
author: John Snow Labs
name: bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512
date: 2022-07-06
tags: [bert, ner, open_source, en]
task: Named Entity Recognition
language: en
nav_key: models
edition: Spark NLP 4.0.0
spark_version: 3.0
supported: true
annotator: BertForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Modified-SciBERT-512` is a English model originally trained by `ghadeermobasher`.
## Predicted Entities
`Chemical`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512_en_4.0.0_3.0_1657108506933.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512_en_4.0.0_3.0_1657108506933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512|
|Compatibility:|Spark NLP 4.0.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|410.5 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
- https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Modified-SciBERT-512
---
layout: model
title: Financial Sentiment Analysis (Lithuanian)
author: John Snow Labs
name: finclf_bert_sentiment_analysis
date: 2022-10-22
tags: [lt, legal, classification, sentiment, analysis, licensed]
task: Text Classification
language: lt
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: FinanceBertForSequenceClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This is a Lithuanian Sentiment Analysis Text Classifier, which will retrieve if a text is either expression a Positive Emotion or a Negative one.
## Predicted Entities
`POS`,`NEG`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_bert_sentiment_analysis_lt_1.0.0_3.0_1666475378253.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_bert_sentiment_analysis_lt_1.0.0_3.0_1666475378253.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
# Test classifier in Spark NLP pipeline
document_assembler = nlp.DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = nlp.Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
# Load newly trained classifier
sequenceClassifier_loaded = finance.BertForSequenceClassification.pretrained("finclf_bert_sentiment_analysis", "lt", "finance/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier_loaded
])
# Generating example
example = spark.createDataFrame([["Pagalbos paraðiuto laukiantis verslas priemones vertina teigiamai tik yra keli „jeigu“"]]).toDF("text")
result = pipeline.fit(example).transform(example)
# Checking results
result.select("text", "class.result").show(truncate=False)
```
## Results
```bash
+---------------------------------------------------------------------------------------+------+
|text |result|
+---------------------------------------------------------------------------------------+------+
|Pagalbos paraðiuto laukiantis verslas priemones vertina teigiamai tik yra keli „jeigu“|[POS] |
+---------------------------------------------------------------------------------------+------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_bert_sentiment_analysis|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|lt|
|Size:|406.6 MB|
|Case sensitive:|true|
|Max sentence length:|128|
## References
An in-house augmented version of [this dataset](https://www.kaggle.com/datasets/rokastrimaitis/lithuanian-financial-news-dataset-and-bigrams?select=dataset%28original%29.csv) removing NEU tag
## Benchmarking
```bash
label precision recall f1-score support
NEG 0.80 0.76 0.78 509
POS 0.90 0.92 0.91 1167
accuracy - - 0.87 1676
macro-avg 0.85 0.84 0.84 1676
weighted-avg 0.87 0.87 0.87 1676
```
---
layout: model
title: Pipeline to Classify Texts into TREC-6 Classes
author: John Snow Labs
name: bert_sequence_classifier_trec_coarse_pipeline
date: 2022-02-23
tags: [bert_sequence, trec, coarse, bert, en, open_source]
task: Text Classification
language: en
nav_key: models
edition: Spark NLP 3.4.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [bert_sequence_classifier_trec_coarse_en](https://nlp.johnsnowlabs.com/2021/11/06/bert_sequence_classifier_trec_coarse_en.html).
The TREC dataset for question classification consists of open-domain, fact-based questions divided into broad semantic categories. You can check the official documentation of the dataset, entities, etc. [here](https://search.r-project.org/CRAN/refmans/textdata/html/dataset_trec.html).
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_trec_coarse_pipeline_en_3.4.0_3.0_1645609565500.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_trec_coarse_pipeline_en_3.4.0_3.0_1645609565500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
trec_pipeline = PretrainedPipeline("bert_sequence_classifier_trec_coarse_pipeline", lang = "en")
trec_pipeline.annotate("Germany is the largest country in Europe economically.")
```
```scala
val trec_pipeline = new PretrainedPipeline("bert_sequence_classifier_trec_coarse_pipeline", lang = "en")
trec_pipeline.annotate("Germany is the largest country in Europe economically.")
```
## Results
```bash
['LOC']
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_trec_coarse_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 3.4.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|406.6 MB|
## Included Models
- DocumentAssembler
- TokenizerModel
- BertForSequenceClassification
---
layout: model
title: English RobertaForQuestionAnswering Cased model (from AnonymousSub)
author: John Snow Labs
name: roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223837089.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223837089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|463.4 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/AnonymousSub/rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0
---
layout: model
title: Pipeline to Detect PHI for Deidentification (Generic)
author: John Snow Labs
name: ner_deid_generic_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, deid, de]
task: Named Entity Recognition
language: de
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_generic_de.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_de_3.4.1_3.0_1647888023955.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_de_3.4.1_3.0_1647888023955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_deid_generic_pipeline", "de", "clinical/models")
pipeline.annotate("Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.")
```
```scala
val pipeline = new PretrainedPipeline("ner_deid_generic_pipeline", "de", "clinical/models")
pipeline.annotate("Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.")
```
{:.nlu-block}
```python
import nlu
nlu.load("de.med_ner.deid_generic.pipeline").predict("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""")
```
## Results
```bash
+-----------------------------------------+---------+
|chunk |ner_label|
+-----------------------------------------+---------+
|Michael Berger |NAME |
|12 Dezember 2018 |DATE |
|St. Elisabeth-Krankenhausin Bad Kissingen|LOCATION |
|Berger |NAME |
|76 |AGE |
+-----------------------------------------+---------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_deid_generic_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|de|
|Size:|1.3 GB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
---
layout: model
title: Fast Neural Machine Translation Model from Arabic to Spanish
author: John Snow Labs
name: opus_mt_ar_es
date: 2021-06-01
tags: [open_source, seq2seq, translation, ar, es, xx, multilingual]
task: Translation
language: xx
edition: Spark NLP 3.1.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
source languages: ar
target languages: es
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_es_xx_3.1.0_2.4_1622550859700.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_es_xx_3.1.0_2.4_1622550859700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_ar_es", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("translation")
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")
val marian = MarianTransformer.pretrained("opus_mt_ar_es", "xx")
.setInputCols("sentence")
.setOutputCol("translation")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.Arabic.translate_to.Spanish').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|opus_mt_ar_es|
|Compatibility:|Spark NLP 3.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence]|
|Output Labels:|[translation]|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English DistilBertForQuestionAnswering Base Cased model (from nlpunibo)
author: John Snow Labs
name: distilbert_qa_base_config2
date: 2023-01-03
tags: [en, open_source, distilbert, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_config2` is a English model originally trained by `nlpunibo`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config2_en_4.3.0_3.0_1672774448417.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config2_en_4.3.0_3.0_1672774448417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config2","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config2","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|distilbert_qa_base_config2|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document_question, document_context]|
|Output Labels:|[answer]|
|Language:|en|
|Size:|247.6 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## References
- https://huggingface.co/nlpunibo/distilbert_base_config2
---
layout: model
title: Telugu Bert Embeddings
author: John Snow Labs
name: bert_embeddings_telugu_bertu
date: 2022-04-11
tags: [bert, embeddings, te, open_source]
task: Embeddings
language: te
edition: Spark NLP 3.4.2
spark_version: 3.0
supported: true
recommended: true
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `telugu_bertu` is a Telugu model orginally trained by `kuppuluri`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_telugu_bertu_te_3.4.2_3.0_1649675264476.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_telugu_bertu_te_3.4.2_3.0_1649675264476.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_telugu_bertu","te") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_telugu_bertu","te")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను").toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("te.embed.telugu_bertu").predict("""నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను""")
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_embeddings_telugu_bertu|
|Compatibility:|Spark NLP 3.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[bert]|
|Language:|te|
|Size:|415.4 MB|
|Case sensitive:|true|
## References
- https://huggingface.co/kuppuluri/telugu_bertu
---
layout: model
title: Translate English to Russian Pipeline
author: John Snow Labs
name: translate_en_ru
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, ru, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `ru`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ru_xx_2.7.0_2.4_1609687763987.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ru_xx_2.7.0_2.4_1609687763987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_ru", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_ru", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.ru').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_ru|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: Translate English to Lushai Pipeline
author: John Snow Labs
name: translate_en_lus
date: 2021-01-03
task: [Translation, Pipeline Public]
language: xx
edition: Spark NLP 2.7.0
spark_version: 2.4
tags: [open_source, seq2seq, translation, pipeline, en, lus, xx]
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- source languages: `en`
- target languages: `lus`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_lus_xx_2.7.0_2.4_1609690402942.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_lus_xx_2.7.0_2.4_1609690402942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("translate_en_lus", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("translate_en_lus", lang = "xx")
pipeline.annotate("Your sentence to translate!")
```
{:.nlu-block}
```python
import nlu
text = ["text to translate"]
translate_df = nlu.load('xx.en.translate_to.lus').predict(text, output_level='sentence')
translate_df
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|translate_en_lus|
|Type:|pipeline|
|Compatibility:|Spark NLP 2.7.0+|
|Edition:|Official|
|Language:|xx|
## Data Source
[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)
---
layout: model
title: English T5ForConditionalGeneration Base Cased model (from mrm8488)
author: John Snow Labs
name: t5_base_finetuned_span_sentiment_extraction
date: 2023-01-30
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-span-sentiment-extraction` is a English model originally trained by `mrm8488`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_span_sentiment_extraction_en_4.3.0_3.0_1675109003319.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_span_sentiment_extraction_en_4.3.0_3.0_1675109003319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_base_finetuned_span_sentiment_extraction","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_base_finetuned_span_sentiment_extraction","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_base_finetuned_span_sentiment_extraction|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|905.9 MB|
## References
- https://huggingface.co/mrm8488/t5-base-finetuned-span-sentiment-extraction
- https://twitter.com/AND__SO
- https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
- https://www.kaggle.com/c/tweet-sentiment-extraction
- https://arxiv.org/pdf/1910.10683.pdf
- https://www.kaggle.com/c/tweet-sentiment-extraction
- https://github.com/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb
- https://github.com/enzoampil
- https://twitter.com/mrm8488
- https://www.linkedin.com/in/manuel-romero-cs/
---
layout: model
title: English image_classifier_vit_Infrastructures ViTForImageClassification from drab
author: John Snow Labs
name: image_classifier_vit_Infrastructures
date: 2022-08-10
tags: [vit, en, images, open_source]
task: Image Classification
language: en
nav_key: models
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: ViTForImageClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Infrastructures` is a English model originally trained by drab.
## Predicted Entities
`Cooling tower`, `Transmission grid`, `Wind turbines`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Infrastructures_en_4.1.0_3.0_1660167727547.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Infrastructures_en_4.1.0_3.0_1660167727547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_Infrastructures", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_Infrastructures", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_Infrastructures|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: Kinyarwanda XLMRobertaForTokenClassification Base Cased model (from mbeukman)
author: John Snow Labs
name: xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand
date: 2022-08-01
tags: [rw, open_source, xlm_roberta, ner]
task: Named Entity Recognition
language: rw
edition: Spark NLP 4.1.0
spark_version: 3.0
supported: true
annotator: XlmRoBertaForTokenClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-kinyarwanda-finetuned-ner-kinyarwanda` is a Kinyarwanda model originally trained by `mbeukman`.
## Predicted Entities
`PER`, `DATE`, `ORG`, `LOC`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand_rw_4.1.0_3.0_1659354149074.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand_rw_4.1.0_3.0_1659354149074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols(["text"]) \
.setOutputCols("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand","rw") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand","rw")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document", "token', "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[ner]|
|Language:|rw|
|Size:|1.0 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-kinyarwanda-finetuned-ner-kinyarwanda
- https://arxiv.org/abs/2103.11811
- https://github.com/Michael-Beukman/NERTransfer
- https://github.com/masakhane-io/masakhane-ner
---
layout: model
title: Detect Entities Related to Cancer Therapies
author: John Snow Labs
name: ner_oncology_therapy_wip
date: 2022-09-30
tags: [licensed, clinical, oncology, en, ner, treatment]
task: Named Entity Recognition
language: en
nav_key: models
edition: Healthcare NLP 4.0.0
spark_version: 3.0
supported: true
annotator: MedicalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model extracts entities related to oncology therapies using granular labels, including mentions of treatments, posology information and line of therapy.
Definitions of Predicted Entities:
- `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment.
- `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as "chemotherapy".
- `Cycle_Count`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles").
- `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5").
- `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle").
- `Dosage`: The quantity prescribed by the physician for an active ingredient.
- `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks").
- `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid").
- `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as "hormonal therapy".
- `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as "immunotherapy".
- `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment").
- `Radiotherapy`: Terms that indicate the use of Radiotherapy.
- `Radiation_Dose`: Dose used in radiotherapy.
- `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement".
- `Route`: Words indicating the type of administration route (such as "PO" or "transdermal").
- `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as "targeted therapy".
- `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. "chemoradiotherapy" or "adjuvant therapy").
## Predicted Entities
`Cancer_Surgery`, `Chemotherapy`, `Cycle_Count`, `Cycle_Day`, `Cycle_Number`, `Dosage`, `Duration`, `Frequency`, `Hormonal_Therapy`, `Immunotherapy`, `Line_Of_Therapy`, `Radiotherapy`, `Radiation_Dose`, `Response_To_Treatment`, `Route`, `Targeted_Therapy`, `Unspecific_Therapy`
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_wip_en_4.0.0_3.0_1664557936894.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_wip_en_4.0.0_3.0_1664557936894.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_therapy_wip", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_therapy_wip", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.
The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.
The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.oncology_therapy_wip").predict("""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_roomclassifier", "en")\
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
imageClassifier,
])
pipelineModel = pipeline.fit(imageDF)
pipelineDF = pipelineModel.transform(imageDF)
```
```scala
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val imageClassifier = ViTForImageClassification
.pretrained("image_classifier_vit_roomclassifier", "en")
.setInputCols("image_assembler")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineModel = pipeline.fit(imageDF)
val pipelineDF = pipelineModel.transform(imageDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|image_classifier_vit_roomclassifier|
|Compatibility:|Spark NLP 4.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[image_assembler]|
|Output Labels:|[class]|
|Language:|en|
|Size:|321.9 MB|
---
layout: model
title: SDOH Environment Status Classification
author: John Snow Labs
name: bert_sequence_classifier_sdoh_environment_status
date: 2022-12-18
tags: [en, clinical, sdoh, licensed, sequence_classification, environment_status, classifier]
task: Text Classification
language: en
nav_key: models
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
annotator: MedicalBertForSequenceClassification
engine: tensorflow
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model classifies related to environment situation such as any indication of housing, homeless or no related passage. A discharge summary was classified as True for the SDOH Environment if there was any indication of housing, False if the patient was homeless and None if there was no related passage.
## Predicted Entities
`True`, `False`, `None`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_sdoh_environment_status_en_4.2.2_3.0_1671371837321.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_sdoh_environment_status_en_4.2.2_3.0_1671371837321.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_environment_status", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("class")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
sample_texts = ["The patient is a 29-year-old female with a history of renal transplant in 2097, who had worsening renal failure for the past several months. Her chief complaints were hypotension and seizure. months prior to admission and had been more hypertensive recently, requiring blood pressure medications. She was noted to have worsening renal function secondary to recent preeclampsia and her blood pressure control was thought to be secondary to renal failure.",
"Mr Known lastname 19017 is a 66 year-old man with a PMHx of stage 4 COPD (FEV1 0.65L;FEV1/FVC 37% predicted in 4-14) on 4L home o2 with numerous hospitalizations for COPD exacerbations and intubation, hypertension, coronary artery disease, GERD who presents with SOB and CP. He is admitted to the ICU for management of dyspnea and hypotension.",
"He was deemed Child's B in 2156-5-17 with ongoing ethanol abuse, admitted to Intensive Care Unit due to acute decompensation of chronic liver disease due to alcoholic hepatitis and Escherichia coli sepsis. after being hit in the head with the a bottle and dropping to the floor in the apartment. They had Trauma work him up including a head computerized tomography scan which was negative. He had abdominal pain for approximately one month with increasing abdominal girth, was noted to be febrile to 100 degrees on presentation and was tachycardiac 130, stable blood pressures. He was noted to have distended abdomen with diffuse tenderness computerized tomography scan of the abdomen which showed ascites and large nodule of the liver, splenomegaly, paraesophageal varices and loops of thickened bowel."]
data = spark.createDataFrame(sample_texts, StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_environment_status", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler,
tokenizer,
sequenceClassifier))
val data = Seq("The patient is a 29-year-old female with a history of renal transplant in 2097, who had worsening renal failure for the past several months. Her chief complaints were hypotension and seizure. months prior to admission and had been more hypertensive recently, requiring blood pressure medications. She was noted to have worsening renal function secondary to recent preeclampsia and her blood pressure control was thought to be secondary to renal failure.")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.classify.bert_sequence.sdoh.environment_status").predict("""He was deemed Child's B in 2156-5-17 with ongoing ethanol abuse, admitted to Intensive Care Unit due to acute decompensation of chronic liver disease due to alcoholic hepatitis and Escherichia coli sepsis. after being hit in the head with the a bottle and dropping to the floor in the apartment. They had Trauma work him up including a head computerized tomography scan which was negative. He had abdominal pain for approximately one month with increasing abdominal girth, was noted to be febrile to 100 degrees on presentation and was tachycardiac 130, stable blood pressures. He was noted to have distended abdomen with diffuse tenderness computerized tomography scan of the abdomen which showed ascites and large nodule of the liver, splenomegaly, paraesophageal varices and loops of thickened bowel.""")
```
## Results
```bash
+----------------------------------------------------------------------------------------------------+-------+
| text| result|
+----------------------------------------------------------------------------------------------------+-------+
|The patient is a 29-year-old female with a history of renal transplant in 2097, who had worsening...| [None]|
|Mr Known lastname 19017 is a 66 year-old man with a PMHx of stage 4 COPD (FEV1 0.65L;FEV1/FVC 37%...|[False]|
|He was deemed Child's B in 2156-5-17 with ongoing ethanol abuse, admitted to Intensive Care Unit ...| [True]|
+----------------------------------------------------------------------------------------------------+-------+
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|bert_sequence_classifier_sdoh_environment_status|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|410.9 MB|
|Case sensitive:|true|
|Max sentence length:|512|
## Benchmarking
```bash
label precision recall f1-score support
None 0.89 0.78 0.83 277
False 0.86 0.93 0.90 419
True 0.67 1.00 0.80 6
accuracy - - 0.87 702
macro-avg 0.81 0.90 0.84 702
weighted-avg 0.87 0.87 0.87 702
```
---
layout: model
title: Danish asr_wav2vec2_xls_r_300m_ftspeech TFWav2Vec2ForCTC from saattrupdan
author: John Snow Labs
name: asr_wav2vec2_xls_r_300m_ftspeech
date: 2022-09-25
tags: [wav2vec2, da, audio, open_source, asr]
task: Automatic Speech Recognition
language: da
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: Wav2Vec2ForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_ftspeech` is a Danish model originally trained by saattrupdan.
NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_ftspeech_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_ftspeech_da_4.2.0_3.0_1664101599815.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_ftspeech_da_4.2.0_3.0_1664101599815.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
audio_assembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
speech_to_text = Wav2Vec2ForCTC \
.pretrained("asr_wav2vec2_xls_r_300m_ftspeech", "da")\
.setInputCols("audio_assembler") \
.setOutputCol("text")
pipeline = Pipeline(stages=[
audio_assembler,
speech_to_text,
])
pipelineModel = pipeline.fit(audioDf)
pipelineDF = pipelineModel.transform(audioDf)
```
```scala
val audioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
val speechToText = Wav2Vec2ForCTC
.pretrained("asr_wav2vec2_xls_r_300m_ftspeech", "da")
.setInputCols("audio_assembler")
.setOutputCol("text")
val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
val pipelineModel = pipeline.fit(audioDf)
val pipelineDF = pipelineModel.transform(audioDf)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|asr_wav2vec2_xls_r_300m_ftspeech|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|da|
|Size:|1.2 GB|
---
layout: model
title: English RobertaForQuestionAnswering Large Cased model (from tli8hf)
author: John Snow Labs
name: roberta_qa_unqover_large_news
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-roberta-large-newsqa` is a English model originally trained by `tli8hf`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_large_news_en_4.3.0_3.0_1674224676431.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_large_news_en_4.3.0_3.0_1674224676431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_large_news","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_large_news","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_unqover_large_news|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|1.3 GB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/tli8hf/unqover-roberta-large-newsqa
---
layout: model
title: Financial Legal proceedings Item Binary Classifier
author: John Snow Labs
name: finclf_legal_proceedings_item
date: 2022-08-10
tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed]
task: Text Classification
language: en
nav_key: models
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This model is a Binary Classifier (True, False) for the `legal_proceedings` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.
If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
## Predicted Entities
`other`, `legal_proceedings`
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_legal_proceedings_item_en_1.0.0_3.2_1660154442715.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_legal_proceedings_item_en_1.0.0_3.2_1660154442715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
## Results
```bash
+-------+
| result|
+-------+
|[legal_proceedings]|
|[other]|
|[other]|
|[legal_proceedings]|
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|finclf_legal_proceedings_item|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.6 MB|
## References
Weak labelling on documents from Edgar database
## Benchmarking
```bash
label precision recall f1-score support
legal_proceedings 0.96 0.88 0.92 25
other 0.92 0.97 0.95 36
accuracy - - 0.93 61
macro-avg 0.94 0.93 0.93 61
weighted-avg 0.94 0.93 0.93 61
```
---
layout: model
title: English T5ForConditionalGeneration Cased model (from lordtt13)
author: John Snow Labs
name: t5_inshorts
date: 2023-01-31
tags: [en, open_source, t5, tensorflow]
task: Text Generation
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-inshorts` is a English model originally trained by `lordtt13`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_inshorts_en_4.3.0_3.0_1675124897561.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_inshorts_en_4.3.0_3.0_1675124897561.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_inshorts","en") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_inshorts","en")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_inshorts|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|en|
|Size:|927.0 MB|
## References
- https://huggingface.co/lordtt13/t5-inshorts
- https://arxiv.org/abs/1910.10683
- https://www.kaggle.com/shashichander009/inshorts-news-data
- https://github.com/lordtt13/transformers-experiments/blob/master/Custom%20Tasks/fine-tune-t5-for-summarization.ipynb
- https://github.com/lordtt13
- https://www.linkedin.com/in/tanmay-thakur-6bb5a9154/
---
layout: model
title: English asr_wav2vec2_base_timit_demo_colab240 TFWav2Vec2ForCTC from hassnain
author: John Snow Labs
name: pipeline_asr_wav2vec2_base_timit_demo_colab240
date: 2022-09-24
tags: [wav2vec2, en, audio, open_source, pipeline, asr]
task: Automatic Speech Recognition
language: en
nav_key: models
edition: Spark NLP 4.2.0
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab240` is a English model originally trained by hassnain.
NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab240_gpu
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab240_en_4.2.0_3.0_1664023948102.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab240_en_4.2.0_3.0_1664023948102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab240', lang = 'en')
annotations = pipeline.transform(audioDF)
```
```scala
val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab240", lang = "en")
val annotations = pipeline.transform(audioDF)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab240|
|Type:|pipeline|
|Compatibility:|Spark NLP 4.2.0+|
|License:|Open Source|
|Edition:|Official|
|Language:|en|
|Size:|355.0 MB|
## Included Models
- AudioAssembler
- Wav2Vec2ForCTC
---
layout: model
title: Russian T5ForConditionalGeneration Small Cased model (from cointegrated)
author: John Snow Labs
name: t5_rut5_small_normalizer
date: 2023-01-30
tags: [ru, open_source, t5, tensorflow]
task: Text Generation
language: ru
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: T5Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rut5-small-normalizer` is a Russian model originally trained by `cointegrated`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_rut5_small_normalizer_ru_4.3.0_3.0_1675106835622.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_rut5_small_normalizer_ru_4.3.0_3.0_1675106835622.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")
t5 = T5Transformer.pretrained("t5_rut5_small_normalizer","ru") \
.setInputCols("document") \
.setOutputCol("answers")
pipeline = Pipeline(stages=[documentAssembler, t5])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols("text")
.setOutputCols("document")
val t5 = T5Transformer.pretrained("t5_rut5_small_normalizer","ru")
.setInputCols("document")
.setOutputCol("answers")
val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|t5_rut5_small_normalizer|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[t5]|
|Language:|ru|
|Size:|277.8 MB|
## References
- https://huggingface.co/cointegrated/rut5-small-normalizer
- https://github.com/natasha/natasha
- https://github.com/kmike/pymorphy2
- https://wortschatz.uni-leipzig.de/en/download/Russian
---
layout: model
title: English RobertaForQuestionAnswering Base Cased model (from jgammack)
author: John Snow Labs
name: roberta_qa_jgammack_base_squad
date: 2023-01-20
tags: [en, open_source, roberta, question_answering, tensorflow]
task: Question Answering
language: en
nav_key: models
edition: Spark NLP 4.3.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: RoBertaForQuestionAnswering
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad` is a English model originally trained by `jgammack`.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_jgammack_base_squad_en_4.3.0_3.0_1674218670079.zip){:.button.button-orange}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_jgammack_base_squad_en_4.3.0_3.0_1674218670079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Document_Assembler = MultiDocumentAssembler()\
.setInputCols(["question", "context"])\
.setOutputCols(["document_question", "document_context"])
Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_jgammack_base_squad","en")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setCaseSensitive(True)
pipeline = Pipeline(stages=[Document_Assembler, Question_Answering])
data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
```
```scala
val Document_Assembler = new MultiDocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_jgammack_base_squad","en")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering))
val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context")
val result = pipeline.fit(data).transform(data)
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|roberta_qa_jgammack_base_squad|
|Compatibility:|Spark NLP 4.3.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[class]|
|Language:|en|
|Size:|464.2 MB|
|Case sensitive:|true|
|Max sentence length:|256|
## References
- https://huggingface.co/jgammack/roberta-base-squad
---
layout: model
title: Pipeline to Detect Drug Information
author: John Snow Labs
name: ner_posology_biobert_pipeline
date: 2022-03-21
tags: [licensed, ner, clinical, drug, en]
task: [Named Entity Recognition, Pipeline Healthcare]
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_posology_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_posology_biobert_en.html) model.
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_biobert_pipeline_en_3.4.1_3.0_1647871826696.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_biobert_pipeline_en_3.4.1_3.0_1647871826696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_posology_biobert_pipeline", "en", "clinical/models")
pipeline.fullAnnotate('The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.')
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_posology_biobert_pipeline", "en", "clinical/models")
pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.")
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.posology_biobert.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""")
```
## Results
```bash
+--------------+---------+
|chunks |entities |
+--------------+---------+
|insulin |DRUG |
|Bactrim |DRUG |
|for 14 days |DURATION |
|Fragmin |DRUG |
|5000 units |DOSAGE |
|subcutaneously|ROUTE |
|daily |FREQUENCY|
|Xenaderm |DRUG |
|topically |ROUTE |
|b.i.d |FREQUENCY|
|Lantus |DRUG |
|40 units |DOSAGE |
|subcutaneously|ROUTE |
|at bedtime |FREQUENCY|
|OxyContin |DRUG |
|30 mg |STRENGTH |
|p.o |ROUTE |
|q.12 h |FREQUENCY|
|folic acid |DRUG |
|1 mg |STRENGTH |
+--------------+---------+
only showing top 20 rows
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|ner_posology_biobert_pipeline|
|Type:|pipeline|
|Compatibility:|Healthcare NLP 3.4.1+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Size:|422.0 MB|
## Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- BertEmbeddings
- MedicalNerModel
- NerConverter
---
layout: model
title: Pipeline to Detect Clinical Entities (ner_jsl)
author: John Snow Labs
name: ner_jsl_pipeline
date: 2023-03-09
tags: [ner, licensed, en, clinical]
task: Named Entity Recognition
language: en
edition: Healthcare NLP 4.3.0
spark_version: 3.2
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pretrained pipeline is built on the top of [ner_jsl](https://nlp.johnsnowlabs.com/2022/10/19/ner_jsl_en.html) model.
{:.btn-box}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_pipeline_en_4.3.0_3.2_1678353833465.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_pipeline_en_4.3.0_3.2_1678353833465.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline("ner_jsl_pipeline", "en", "clinical/models")
text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.'''
result = pipeline.fullAnnotate(text)
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val pipeline = new PretrainedPipeline("ner_jsl_pipeline", "en", "clinical/models")
val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."
val result = pipeline.fullAnnotate(text)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.med_ner.jsl.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")
```
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
...
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = sparknlp.annotators.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
words_embedder = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = NerConverter() \
.setInputCols(["sentences", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
dependency_parser = DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentences", "pos_tags", "tokens"]) \
.setOutputCol("dependencies")
# Set a filter on pairs of named entities which will be treated as relation candidates
re_ner_chunk_filter = RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setMaxSyntacticDistance(10)\
.setOutputCol("re_ner_chunks")\
.setRelationPairs(['direction-external_body_part_or_region',
'external_body_part_or_region-direction',
'direction-internal_organ_or_component',
'internal_organ_or_component-direction'
])
# The dataset this model is trained to is sentence-wise.
# This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
re_model = RelationExtractionDLModel()\
.pretrained('redl_bodypart_direction_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model])
data = spark.createDataFrame([[''' MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia ''']]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
...
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentences"))
.setOutputCol("tokens")
val pos_tagger = PerceptronModel()
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val words_embedder = WordEmbeddingsModel()
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverter()
.setInputCols(Array("sentences", "tokens", "ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
// Set a filter on pairs of named entities which will be treated as relation candidates
val re_ner_chunk_filter = RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setMaxSyntacticDistance(10)
.setOutputCol("re_ner_chunks")
.setRelationPairs(Array('direction-external_body_part_or_region',
'external_body_part_or_region-direction',
'direction-internal_organ_or_component',
'internal_organ_or_component-direction'))
// The dataset this model is trained to is sentence-wise.
// This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input.
val re_model = RelationExtractionDLModel()
.pretrained("redl_bodypart_direction_biobert", "en", "clinical/models")
.setPredictionThreshold(0.5)
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model))
val data = Seq("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.relation").predict("""MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia""")
```
## Results
```bash
| index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence |
|-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------|
| 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 |
| 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 |
| 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 |
| 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 |
| 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 |
| 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 |
| 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 |
| 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 |
| 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 |
```
{:.model-param}
## Model Information
{:.table-model}
|---|---|
|Model Name:|redl_bodypart_direction_biobert|
|Compatibility:|Healthcare NLP 3.0.3+|
|License:|Licensed|
|Edition:|Official|
|Language:|en|
|Case sensitive:|true|
## Data Source
Trained on an internal dataset.
## Benchmarking
```bash
Relation Recall Precision F1 Support
0 0.856 0.873 0.865 153
1 0.986 0.984 0.985 1347
Avg. 0.921 0.929 0.925 -
```
---
layout: model
title: Clinical Deidentification
author: John Snow Labs
name: clinical_deidentification
date: 2022-03-03
tags: [deidentification, en, licensed, pipeline, clinical]
task: Pipeline Healthcare
language: en
nav_key: models
edition: Healthcare NLP 3.4.1
spark_version: 2.4
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---
## Description
This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities
{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_3.4.1_2.4_1646340071616.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_3.4.1_2.4_1646340071616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models")
sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
result = deid_pipeline.annotate(sample)
print("\n".join(result['masked']))
print("\n".join(result['masked_with_chars']))
print("\n".join(result['masked_fixed_length_chars']))
print("\n".join(result['obfuscated']))
```
```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val deid_pipeline = new PretrainedPipeline("clinical_deidentification","en","clinical/models")
val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
val result =deid_pipeline .annotate(sample)
```
{:.nlu-block}
```python
import nlu
nlu.load("en.de_identify.clinical_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""")
```
## Results
```bash
Masked with entity labels
------------------------------
Name : , Record date: , # .
Dr. , ID, IP .
He is a